Using TensorSharp from C#

TensorSharp is a real .NET library, not just a binary. Reference the NuGet packages and drive inference directly from your own code — useful when you want inference inside an existing service without an HTTP hop.

NuGet packages

The repository is split along package boundaries so consumers depend on only the layers they need.

Package	Namespace	Responsibility
`TensorSharp.Core`	`TensorSharp`	Tensor primitives, ops, allocators, storage, device abstraction.
`TensorSharp.Runtime`	`TensorSharp.Runtime`	GGUF parsing, tokenizers, prompt rendering, sampling, paged KV cache, continuous-batching scheduler.
`TensorSharp.Models`	`TensorSharp.Models`	`ModelBase`, architecture implementations, multimodal encoders, batched/paged forward passes.
`TensorSharp.Backends.GGML`	`TensorSharp.GGML`	GGML-backed execution and native interop.
`TensorSharp.Backends.Cuda`	`TensorSharp.Cuda`	Direct CUDA allocator, storage, cuBLAS GEMM, PTX kernels, quantized CUDA ops.
`TensorSharp.Backends.MLX`	`TensorSharp.MLX`	Apple-Silicon MLX backend (mlx-c / Metal).
`TensorSharp.Server`	`TensorSharp.Server`	ASP.NET Core server, OpenAI/Ollama adapters, inference engine host, web UI.
`TensorSharp.Cli`	`TensorSharp.Cli`	Console host and debugging / batch tooling.

For a typical embedding scenario you reference TensorSharp.Models (which pulls in Core + Runtime) and the backend package for your hardware (e.g. TensorSharp.Backends.GGML).

dotnet add package TensorSharp.Models
dotnet add package TensorSharp.Backends.GGML

📦

The NuGet packages are managed-only — the native GGML/CUDA/MLX libraries are not embedded. Build them once (see Backends) or use a self-contained platform binary, and make sure the native library is discoverable at run time.

Minimal example — load, generate, decode

Every model is loaded the same way: ModelBase.Create() reads the GGUF metadata and instantiates the right architecture. From there you tokenize, run forward passes, sample, and decode.

using System;
using System.Collections.Generic;
using System.Linq;
using TensorSharp.Models;
using TensorSharp.Runtime;

// 1. Load any GGUF model — the architecture is auto-detected from metadata.
using var model = ModelBase.Create("Qwen3-4B-Q8_0.gguf", BackendType.GgmlCuda);

// 2. Configure sampling (defaults match Ollama: temp 0.8, top_k 40, top_p 0.9).
var sampling = new SamplingConfig { Temperature = 0.7f, TopP = 0.9f, TopK = 40 };

// 3. Tokenize the prompt.
var tokens = model.Tokenizer
    .Encode("Explain mixture-of-experts in one sentence.", addSpecial: true)
    .ToList();
var generated = new List<int>();

// 4. Autoregressive decode loop.
for (int step = 0; step < 200; step++)
{
    float[] logits = model.Forward(tokens.ToArray());     // logits for the next token
    int next = model.Sample(logits, sampling, generated); // applies penalties + sampling
    if (model.Tokenizer.IsEos(next)) break;
    generated.Add(next);
    tokens.Add(next);
}

// 5. Detokenize the result.
Console.WriteLine(model.Tokenizer.Decode(generated));

For greedy/deterministic decoding, call model.SampleGreedy(logits) instead of Sample.

Smoke-test variant

A one-shot sanity check that loads a model, runs a single forward pass, and prints the top token:

using var model = ModelBase.Create(modelPath, backend);
var tokenIds = model.Tokenizer.Encode("Hello", addSpecial: true);
float[] logits = model.Forward(tokenIds.ToArray());

int topToken = model.SampleGreedy(logits);
Console.WriteLine($"vocab={model.Config.VocabSize}, tokens={tokenIds.Count}, topToken={topToken}");

SamplingConfig

The sampling knobs map one-to-one to the CLI flags and API options. Defaults match Ollama.

Property	Type	Default	Meaning
`Temperature`	float	0.8	Randomness; 0 = greedy/deterministic.
`TopK`	int	40	Limit to the top-K most probable tokens; 0 = disabled.
`TopP`	float	0.9	Nucleus sampling; 1.0 = disabled.
`MinP`	float	0	Minimum probability threshold relative to the max.
`RepetitionPenalty`	float	1.1	Multiplicative penalty; >1 discourages repetition.
`PresencePenalty`	float	0	Additive penalty for tokens already present.
`FrequencyPenalty`	float	0	Additive penalty proportional to token frequency.
`Seed`	int	-1	Reproducible sampling; -1 = time-based.
`StopSequences`	List<string>	null	Stop when any of these strings is produced.
`MaxTokens`	int	0	Maximum tokens to generate; 0 = use the caller's default.

Key types & interfaces

Type	Role
`ModelBase`	Abstract base for every architecture. `Create(path, backend)`, `Forward(int[])`, `Sample(...)`, `SampleGreedy(...)`, plus `Config` and `Tokenizer`.
`BackendType`	Enum: `Cpu`, `GgmlCpu`, `GgmlMetal`, `GgmlCuda`, `Cuda`, `Mlx`.
`SamplingConfig`	Sampling configuration (table above).
`ITokenizer`	`Encode(text, addSpecial)`, `Decode(ids)`, `IsEos(id)`, `EosTokenIds` (BPE & SentencePiece implementations).
`ModelConfig`	Architecture metadata: `VocabSize`, context length, and more.
`IBatchedPagedModel`	Optional batched/paged forward (`ForwardBatch`) implemented by most architectures for continuous batching.
`InferenceEngine`	Worker-thread scheduler + paged block pool that powers the server's continuous batching (in `TensorSharp.Runtime.Scheduling`).

Other runtime contracts worth knowing: IModelArchitecture, IPromptRenderer, IOutputProtocolParser, IMultimodalInjector, IKvBlockCodec (with the built-in TurboQuantKvCodec), and IKVCachePolicy.

💡

For most applications the easiest integration is to run TensorSharp.Server and call it over the OpenAI-compatible API — you keep your app process clean and get continuous batching for free. Reach for the library API when you need in-process control or custom decoding.

← HTTP API Next: Advanced Features →