Compute Backends

A backend is selected with --backend (CLI and server) and decides which hardware runs the math. Every backend falls back to CPU for any unimplemented op, so output stays correct everywhere — backends differ only in speed.

Which backend should I use?

Your hardwareRecommendedFlagNotes
Apple Silicon (Mac)GGML Metalggml_metalDefault on macOS. mlx is an alternative Apple-Silicon GPU path.
Windows / Linux + NVIDIAGGML CUDAggml_cudaMost-tested NVIDIA path. cuda is the direct PTX/cuBLAS backend for experimentation.
No GPU / portability / debuggingPure C# CPUcpuNo native dependencies. For faster CPU inference use ggml_cpu (native kernels).

All backends in detail

BackendFlagBest fit
Direct CUDA / cuBLAScudaNVIDIA inference & experimentation
MLX MetalmlxApple Silicon (alternative to GGML Metal)
GGML Metalggml_metalApple Silicon (default on macOS)
GGML CUDAggml_cudaNVIDIA inference through ggml
GGML CPUggml_cpuNative CPU kernels
Pure C# CPUcpuPortability & debugging

GGML Metal & GGML CUDA

The most-tested GPU paths, built on a native C++ bridge that links ggml.

Both run native quantized matmul (Q4_K_M, Q8_0, …) without dequantizing to FP32, plus a native paged-attention kernel that drives ggml_flash_attn_ext.

MLX Metal

--backend mlx is a GPU-accelerated Apple-Silicon path built on mlx-c. It implements quantized ops (Q4_K_M, Q8_0, Q5_K, Q6_K, IQ2_XXS, IQ4_XS, IQ4_NL, MXFP4, …) without dequantizing to FP32, fused decode/prefill Metal kernels, compiled-graph kernels, async worker dispatch, batched MoE decode, and MoE expert offload. It pins the GGUF mmap in physical RAM via mlock(2) and derives allocator caps from the host's unified-memory capacity. Requires libmlxc (built locally or located via TENSORSHARP_MLX_LIBRARY / TENSORSHARP_MLX_LIBRARY_DIR).

Direct CUDA

--backend cuda is a pure-C# path using the CUDA Driver API, cuBLAS GEMM, and PTX kernels for common float32 ops (fill, unary/binary/ternary, activations, RMSNorm, softmax, RoPE/RoPEEx, SDPA, GQA prefill/decode, causal mask, gather/concat) plus native quantized matmul/get-rows for supported quant types. Unsupported ops route through CPU fallbacks while preserving tensor semantics. It is also the pure-C# backend where MTP speculative decoding is profitable.

CPU backends

🔎

The server reports which backends are actually available on the host in GET /api/models (supportedBackends). If a CUDA or MLX backend is missing, the host did not detect a usable driver/runtime at startup.

Building the native libraries

The native GGML library is built automatically on the first dotnet build. To build it manually or with CUDA:

cd TensorSharp.GGML.Native

# macOS (Metal)
bash build-macos.sh

# Linux — CPU only / with CUDA
bash build-linux.sh
bash build-linux.sh --cuda
# Windows — CPU only / with CUDA
.\build-windows.ps1 --no-cuda
.\build-windows.ps1 --cuda

On Windows/Linux the script auto-detects the visible NVIDIA GPU compute capability and passes a narrow CMAKE_CUDA_ARCHITECTURES (e.g. 86-real on an RTX 3080), which cuts CUDA build time substantially. Override it explicitly:

TENSORSHARP_GGML_NATIVE_CUDA_ARCHITECTURES='86-real;89-real' bash build-linux.sh --cuda
bash build-linux.sh --cuda --cuda-arch='86-real;89-real'

You can also request CUDA from dotnet build directly:

TENSORSHARP_GGML_NATIVE_ENABLE_CUDA=ON dotnet build TensorSharp.Cli/TensorSharp.Cli.csproj -c Release

MLX native library (macOS only)

The MLX backend depends on libmlxc. A helper script fetches and builds it:

bash TensorSharp.Backends.MLX/build-native-macos.sh

It writes the libraries into TensorSharp.Backends.MLX/Native/dist/. At run time the backend probes the application directory first; point it elsewhere with TENSORSHARP_MLX_LIBRARY or TENSORSHARP_MLX_LIBRARY_DIR.

Pre-built platform binaries

Pushing a v* tag builds self-contained archives that bundle the .NET 10 runtime and the platform's native libraries — they run without a separate .NET install or native build:

ArchiveNative backend(s) bundledFormat
win-x64-cpuGGML CPU.zip
win-x64-cudaGGML CUDA + pure-C# CUDA (PTX) + CUDA 12.x runtime.zip
linux-x64-cpuGGML CPU.tar.gz
linux-x64-cudaGGML CUDA + pure-C# CUDA (PTX) + CUDA 12.x runtime.tar.gz
osx-arm64GGML Metal + MLX.tar.gz

The -cuda archives still require an NVIDIA GPU and a compatible driver at run time; the macOS archive requires Apple Silicon.