Compute Backends

A backend is selected with --backend (CLI and server) and decides which hardware runs the math. Every backend falls back to CPU for any unimplemented op, so output stays correct everywhere — backends differ only in speed.

Which backend should I use?

Your hardware	Recommended	Flag	Notes
Apple Silicon (Mac)	GGML Metal	`ggml_metal`	Default on macOS. `mlx` is an alternative Apple-Silicon GPU path.
Windows / Linux + NVIDIA	GGML CUDA	`ggml_cuda`	Most-tested NVIDIA path. `cuda` is the direct PTX/cuBLAS backend for experimentation.
No GPU / portability / debugging	Pure C# CPU	`cpu`	No native dependencies. For faster CPU inference use `ggml_cpu` (native kernels).

All backends in detail

Backend	Flag	Best fit
Direct CUDA / cuBLAS	`cuda`	NVIDIA inference & experimentation
MLX Metal	`mlx`	Apple Silicon (alternative to GGML Metal)
GGML Metal	`ggml_metal`	Apple Silicon (default on macOS)
GGML CUDA	`ggml_cuda`	NVIDIA inference through ggml
GGML CPU	`ggml_cpu`	Native CPU kernels
Pure C# CPU	`cpu`	Portability & debugging

GGML Metal & GGML CUDA

The most-tested GPU paths, built on a native C++ bridge that links ggml.

GGML Metal (ggml_metal, macOS) — quantized weights are mapped zero-copy from the GGUF file into Metal command buffers via host-pointer buffers, so the resident set stays close to the on-disk model size.
GGML CUDA (ggml_cuda, Windows/Linux + NVIDIA) — quantized weights are uploaded to device memory once at load time and the host copy is released afterwards.

Both run native quantized matmul (Q4_K_M, Q8_0, …) without dequantizing to FP32, plus a native paged-attention kernel that drives ggml_flash_attn_ext.

MLX Metal

--backend mlx is a GPU-accelerated Apple-Silicon path built on mlx-c. It implements quantized ops (Q4_K_M, Q8_0, Q5_K, Q6_K, IQ2_XXS, IQ4_XS, IQ4_NL, MXFP4, …) without dequantizing to FP32, fused decode/prefill Metal kernels, compiled-graph kernels, async worker dispatch, batched MoE decode, and MoE expert offload. It pins the GGUF mmap in physical RAM via mlock(2) and derives allocator caps from the host's unified-memory capacity. Requires libmlxc (built locally or located via TENSORSHARP_MLX_LIBRARY / TENSORSHARP_MLX_LIBRARY_DIR).

Direct CUDA

--backend cuda is a pure-C# path using the CUDA Driver API, cuBLAS GEMM, and PTX kernels for common float32 ops (fill, unary/binary/ternary, activations, RMSNorm, softmax, RoPE/RoPEEx, SDPA, GQA prefill/decode, causal mask, gather/concat) plus native quantized matmul/get-rows for supported quant types. Unsupported ops route through CPU fallbacks while preserving tensor semantics. It is also the pure-C# backend where MTP speculative decoding is profitable.

CPU backends

Pure C# CPU (cpu) — portable inference with no native dependencies, using managed GEMM fast paths and fused SIMD kernels. Ideal for debugging and maximum portability.
GGML CPU (ggml_cpu) — native GGML CPU kernels with quantized weights mapped zero-copy from the GGUF file. Faster than pure C# on most CPUs.

🔎

The server reports which backends are actually available on the host in GET /api/models (supportedBackends). If a CUDA or MLX backend is missing, the host did not detect a usable driver/runtime at startup.

Building the native libraries

The native GGML library is built automatically on the first dotnet build. To build it manually or with CUDA:

cd TensorSharp.GGML.Native

# macOS (Metal)
bash build-macos.sh

# Linux — CPU only / with CUDA
bash build-linux.sh
bash build-linux.sh --cuda

# Windows — CPU only / with CUDA
.\build-windows.ps1 --no-cuda
.\build-windows.ps1 --cuda

On Windows/Linux the script auto-detects the visible NVIDIA GPU compute capability and passes a narrow CMAKE_CUDA_ARCHITECTURES (e.g. 86-real on an RTX 3080), which cuts CUDA build time substantially. Override it explicitly:

TENSORSHARP_GGML_NATIVE_CUDA_ARCHITECTURES='86-real;89-real' bash build-linux.sh --cuda
bash build-linux.sh --cuda --cuda-arch='86-real;89-real'

You can also request CUDA from dotnet build directly:

TENSORSHARP_GGML_NATIVE_ENABLE_CUDA=ON dotnet build TensorSharp.Cli/TensorSharp.Cli.csproj -c Release

MLX native library (macOS only)

The MLX backend depends on libmlxc. A helper script fetches and builds it:

bash TensorSharp.Backends.MLX/build-native-macos.sh

It writes the libraries into TensorSharp.Backends.MLX/Native/dist/. At run time the backend probes the application directory first; point it elsewhere with TENSORSHARP_MLX_LIBRARY or TENSORSHARP_MLX_LIBRARY_DIR.

Pre-built platform binaries

Pushing a v* tag builds self-contained archives that bundle the .NET 10 runtime and the platform's native libraries — they run without a separate .NET install or native build:

Archive	Native backend(s) bundled	Format
`win-x64-cpu`	GGML CPU	.zip
`win-x64-cuda`	GGML CUDA + pure-C# CUDA (PTX) + CUDA 12.x runtime	.zip
`linux-x64-cpu`	GGML CPU	.tar.gz
`linux-x64-cuda`	GGML CUDA + pure-C# CUDA (PTX) + CUDA 12.x runtime	.tar.gz
`osx-arm64`	GGML Metal + MLX	.tar.gz

The -cuda archives still require an NVIDIA GPU and a compatible driver at run time; the macOS archive requires Apple Silicon.

← Getting Started Next: Supported Models →