Advanced Features

The systems that make TensorSharp fast and scalable: continuous batching with a paged KV cache, speculative decoding, text diffusion, and the kernel- and memory-level optimizations underneath.

Continuous batching & paged KV cache

The server's InferenceEngine is a vLLM-style continuous-batching engine, on by default. Instead of running one request at a time, it interleaves many at the granularity of a single decode step.

Models that have not implemented the batched path still run on the engine's isolated per-sequence KV-swap fallback. Tune it with the TS_SCHED_* variables, or disable it entirely with --no-continuous-batching.

Native paged attention

The native kernel TSGgml_PagedAttentionForward (and a WithSinks variant for GPT OSS) gathers K/V from the paged buffer in C++, builds a small GGML graph per sequence, and dispatches ggml_flash_attn_ext — the same fused Metal/CUDA flash-attention kernel the single-sequence path uses. On a long-context Ministral-3-14B workload (4×~800 tokens) it runs ~21% faster than the legacy per-sequence GGML path.

Batched forward passes (Mistral 3, Gemma 4, GPT OSS, Qwen 3.5/3.6 with a GatedDeltaNet recurrent-state pool, and Nemotron-H with a Mamba2 recurrent-state pool) pack N sequences into one ForwardBatch call with one batched linear-projection matmul per layer and paged K/V scatter. Gemma 4 reaches ~1.5–1.6× legacy throughput; Nemotron-H Mamba2 batched reaches ~3.95× at batch=3 on an Apple M4 Pro.

📖

Full deep dive: docs/PAGED_ATTENTION_AND_CONTINUOUS_BATCHING.md in the repository.

MTP / NextN speculative decoding

Some architectures ship a multi-token-prediction (MTP / NextN) draft head that lets the server run lossless speculative decoding for solo (non-concurrent) sequences. The draft proposes several future tokens cheaply, the trunk verifies all of them in one batched forward, and accepted tokens are committed in a single step.

🎯

Because the request's own sampler — temperature, top-k/p, and all penalties — drives both the draft and the verify, the output is identical to standard decode. Speculation only changes how many forward passes it takes to produce the same tokens.

It is off by default. Enable it on the server with --mtp-spec (env TS_MTP_SPEC=1):

# Qwen 3.6 — the NextN block is embedded in the trunk GGUF, no extra file needed
./TensorSharp.Server --model Qwen3.6-35B-A3B-UD-IQ2_XXS.gguf --backend ggml_cuda \
    --mtp-spec --mtp-draft 8 --mtp-pmin 0.75

# Gemma 4 — load the separate gemma4-assistant draft GGUF that matches the target
./TensorSharp.Server --model gemma-4-12B-it-Q4_K_M.gguf --backend ggml_cuda \
    --mtp-spec --mtp-draft-model gemma-4-12B-assistant-Q8_0.gguf

Two draft-head shapes

Where it's profitable

BackendQwen 3.6Gemma 4
GGML CUDA / GGML Metal✅ fused verify + draft kernels✅ fused verify + draft kernels
Direct CUDA (cuda, pure C#)✅ GPU-resident per-op verify/draft✅ GPU-resident per-op verify/draft
CPU / GGML CPU / MLXstandard decodestandard decode

Tuning: --mtp-draft (default 8) bounds tokens drafted per step; --mtp-pmin (default 0.75) is the minimum draft-head confidence to keep a token. Gemma 4 A/B switches are the TS_GMTP_* env vars.

DiffusionGemma text diffusion

DiffusionGemma is fundamentally different from autoregressive models: it does not call Forward() one token at a time. Instead it uses block-wise EntropyBound denoising over fixed-length canvases on a Gemma-4-derived MoE backbone — the whole answer is refined iteratively rather than written left to right.

Performance optimizations

A cross-architecture summary; each per-model card in docs/models/ walks through the same kernels with the exact GGML graph dispatched.

Memory optimizations