Using TensorSharp from C#

TensorSharp is a real .NET library, not just a binary. Reference the NuGet packages and drive inference directly from your own code — useful when you want inference inside an existing service without an HTTP hop.

NuGet packages

The repository is split along package boundaries so consumers depend on only the layers they need.

PackageNamespaceResponsibility
TensorSharp.CoreTensorSharpTensor primitives, ops, allocators, storage, device abstraction.
TensorSharp.RuntimeTensorSharp.RuntimeGGUF parsing, tokenizers, prompt rendering, sampling, paged KV cache, continuous-batching scheduler.
TensorSharp.ModelsTensorSharp.ModelsModelBase, architecture implementations, multimodal encoders, batched/paged forward passes.
TensorSharp.Backends.GGMLTensorSharp.GGMLGGML-backed execution and native interop.
TensorSharp.Backends.CudaTensorSharp.CudaDirect CUDA allocator, storage, cuBLAS GEMM, PTX kernels, quantized CUDA ops.
TensorSharp.Backends.MLXTensorSharp.MLXApple-Silicon MLX backend (mlx-c / Metal).
TensorSharp.ServerTensorSharp.ServerASP.NET Core server, OpenAI/Ollama adapters, inference engine host, web UI.
TensorSharp.CliTensorSharp.CliConsole host and debugging / batch tooling.

For a typical embedding scenario you reference TensorSharp.Models (which pulls in Core + Runtime) and the backend package for your hardware (e.g. TensorSharp.Backends.GGML).

dotnet add package TensorSharp.Models
dotnet add package TensorSharp.Backends.GGML
📦

The NuGet packages are managed-only — the native GGML/CUDA/MLX libraries are not embedded. Build them once (see Backends) or use a self-contained platform binary, and make sure the native library is discoverable at run time.

Minimal example — load, generate, decode

Every model is loaded the same way: ModelBase.Create() reads the GGUF metadata and instantiates the right architecture. From there you tokenize, run forward passes, sample, and decode.

using System;
using System.Collections.Generic;
using System.Linq;
using TensorSharp.Models;
using TensorSharp.Runtime;

// 1. Load any GGUF model — the architecture is auto-detected from metadata.
using var model = ModelBase.Create("Qwen3-4B-Q8_0.gguf", BackendType.GgmlCuda);

// 2. Configure sampling (defaults match Ollama: temp 0.8, top_k 40, top_p 0.9).
var sampling = new SamplingConfig { Temperature = 0.7f, TopP = 0.9f, TopK = 40 };

// 3. Tokenize the prompt.
var tokens = model.Tokenizer
    .Encode("Explain mixture-of-experts in one sentence.", addSpecial: true)
    .ToList();
var generated = new List<int>();

// 4. Autoregressive decode loop.
for (int step = 0; step < 200; step++)
{
    float[] logits = model.Forward(tokens.ToArray());     // logits for the next token
    int next = model.Sample(logits, sampling, generated); // applies penalties + sampling
    if (model.Tokenizer.IsEos(next)) break;
    generated.Add(next);
    tokens.Add(next);
}

// 5. Detokenize the result.
Console.WriteLine(model.Tokenizer.Decode(generated));

For greedy/deterministic decoding, call model.SampleGreedy(logits) instead of Sample.

Smoke-test variant

A one-shot sanity check that loads a model, runs a single forward pass, and prints the top token:

using var model = ModelBase.Create(modelPath, backend);
var tokenIds = model.Tokenizer.Encode("Hello", addSpecial: true);
float[] logits = model.Forward(tokenIds.ToArray());

int topToken = model.SampleGreedy(logits);
Console.WriteLine($"vocab={model.Config.VocabSize}, tokens={tokenIds.Count}, topToken={topToken}");

SamplingConfig

The sampling knobs map one-to-one to the CLI flags and API options. Defaults match Ollama.

PropertyTypeDefaultMeaning
Temperaturefloat0.8Randomness; 0 = greedy/deterministic.
TopKint40Limit to the top-K most probable tokens; 0 = disabled.
TopPfloat0.9Nucleus sampling; 1.0 = disabled.
MinPfloat0Minimum probability threshold relative to the max.
RepetitionPenaltyfloat1.1Multiplicative penalty; >1 discourages repetition.
PresencePenaltyfloat0Additive penalty for tokens already present.
FrequencyPenaltyfloat0Additive penalty proportional to token frequency.
Seedint-1Reproducible sampling; -1 = time-based.
StopSequencesList<string>nullStop when any of these strings is produced.
MaxTokensint0Maximum tokens to generate; 0 = use the caller's default.

Key types & interfaces

TypeRole
ModelBaseAbstract base for every architecture. Create(path, backend), Forward(int[]), Sample(...), SampleGreedy(...), plus Config and Tokenizer.
BackendTypeEnum: Cpu, GgmlCpu, GgmlMetal, GgmlCuda, Cuda, Mlx.
SamplingConfigSampling configuration (table above).
ITokenizerEncode(text, addSpecial), Decode(ids), IsEos(id), EosTokenIds (BPE & SentencePiece implementations).
ModelConfigArchitecture metadata: VocabSize, context length, and more.
IBatchedPagedModelOptional batched/paged forward (ForwardBatch) implemented by most architectures for continuous batching.
InferenceEngineWorker-thread scheduler + paged block pool that powers the server's continuous batching (in TensorSharp.Runtime.Scheduling).

Other runtime contracts worth knowing: IModelArchitecture, IPromptRenderer, IOutputProtocolParser, IMultimodalInjector, IKvBlockCodec (with the built-in TurboQuantKvCodec), and IKVCachePolicy.

💡

For most applications the easiest integration is to run TensorSharp.Server and call it over the OpenAI-compatible API — you keep your app process clean and get continuous batching for free. Reach for the library API when you need in-process control or custom decoding.