Advanced Features
The systems that make TensorSharp fast and scalable: continuous batching with a paged KV cache, speculative decoding, text diffusion, and the kernel- and memory-level optimizations underneath.
Continuous batching & paged KV cache
The server's InferenceEngine is a vLLM-style continuous-batching engine, on by default. Instead of running one request at a time, it interleaves many at the granularity of a single decode step.
- Paged KV pool — KV cache is partitioned into fixed-size blocks drawn from a shared pool, so memory is allocated per block rather than per worst-case sequence length.
- Block-hash prefix sharing — each full block is content-hashed; identical prefixes (system prompts, shared context) are shared across concurrent and sequential requests instead of recomputed.
- Iteration-level scheduler — admits and preempts sequences mid-batch and packs them into one forward pass on models that implement
IBatchedPagedModel. - Optional SSD tier — cold blocks can spill to an SSD tier for very large KV working sets.
Models that have not implemented the batched path still run on the engine's isolated per-sequence KV-swap fallback. Tune it with the TS_SCHED_* variables, or disable it entirely with --no-continuous-batching.
Native paged attention
The native kernel TSGgml_PagedAttentionForward (and a WithSinks variant for GPT OSS) gathers K/V from the paged buffer in C++, builds a small GGML graph per sequence, and dispatches ggml_flash_attn_ext — the same fused Metal/CUDA flash-attention kernel the single-sequence path uses. On a long-context Ministral-3-14B workload (4×~800 tokens) it runs ~21% faster than the legacy per-sequence GGML path.
Batched forward passes (Mistral 3, Gemma 4, GPT OSS, Qwen 3.5/3.6 with a GatedDeltaNet recurrent-state pool, and Nemotron-H with a Mamba2 recurrent-state pool) pack N sequences into one ForwardBatch call with one batched linear-projection matmul per layer and paged K/V scatter. Gemma 4 reaches ~1.5–1.6× legacy throughput; Nemotron-H Mamba2 batched reaches ~3.95× at batch=3 on an Apple M4 Pro.
Full deep dive: docs/PAGED_ATTENTION_AND_CONTINUOUS_BATCHING.md in the repository.
MTP / NextN speculative decoding
Some architectures ship a multi-token-prediction (MTP / NextN) draft head that lets the server run lossless speculative decoding for solo (non-concurrent) sequences. The draft proposes several future tokens cheaply, the trunk verifies all of them in one batched forward, and accepted tokens are committed in a single step.
Because the request's own sampler — temperature, top-k/p, and all penalties — drives both the draft and the verify, the output is identical to standard decode. Speculation only changes how many forward passes it takes to produce the same tokens.
It is off by default. Enable it on the server with --mtp-spec (env TS_MTP_SPEC=1):
# Qwen 3.6 — the NextN block is embedded in the trunk GGUF, no extra file needed
./TensorSharp.Server --model Qwen3.6-35B-A3B-UD-IQ2_XXS.gguf --backend ggml_cuda \
--mtp-spec --mtp-draft 8 --mtp-pmin 0.75
# Gemma 4 — load the separate gemma4-assistant draft GGUF that matches the target
./TensorSharp.Server --model gemma-4-12B-it-Q4_K_M.gguf --backend ggml_cuda \
--mtp-spec --mtp-draft-model gemma-4-12B-assistant-Q8_0.gguf
Two draft-head shapes
- Qwen 3.6 (embedded NextN) — the GGUF carries one extra decoder block plus the NextN projection/norm tensors. No separate file;
--mtp-draft-modelis ignored. The GatedDeltaNet trunk state is snapshotted so a partially-rejected verify batch can roll back. - Gemma 4 (separate
gemma4-assistantGGUF) — an EAGLE-style recurrent drafter loaded with--mtp-draft-model. It holds no K/V of its own: every draft layer queries the target model's existing per-layer KV cache. The draft's hidden size must match the target (pair the 12B target with its 12B draft). A mismatched or incomplete draft fails fast at startup with a remediation hint.
Where it's profitable
| Backend | Qwen 3.6 | Gemma 4 |
|---|---|---|
| GGML CUDA / GGML Metal | ✅ fused verify + draft kernels | ✅ fused verify + draft kernels |
Direct CUDA (cuda, pure C#) | ✅ GPU-resident per-op verify/draft | ✅ GPU-resident per-op verify/draft |
| CPU / GGML CPU / MLX | standard decode | standard decode |
Tuning: --mtp-draft (default 8) bounds tokens drafted per step; --mtp-pmin (default 0.75) is the minimum draft-head confidence to keep a token. Gemma 4 A/B switches are the TS_GMTP_* env vars.
DiffusionGemma text diffusion
DiffusionGemma is fundamentally different from autoregressive models: it does not call Forward() one token at a time. Instead it uses block-wise EntropyBound denoising over fixed-length canvases on a Gemma-4-derived MoE backbone — the whole answer is refined iteratively rather than written left to right.
- CLI —
--diffusion-steps(denoising steps per block),--diffusion-seed, and--diffusion-blocks. - Web UI — streams whole-message
replaceevents so you watch the answer denoise live, and batches concurrent diffusion requests at block boundaries viaDiffusionBatchScheduler. - Optimizations — on GPU backends the prompt side of
[prompt | canvas]is prefetched once per block and reused across steps; GGML backends use a fused whole-model diffusion decode plus a fused lm-head tail.
Performance optimizations
A cross-architecture summary; each per-model card in docs/models/ walks through the same kernels with the exact GGML graph dispatched.
- Fused GPU decode (Gemma 4) — all transformer layers in a single GGML graph dispatch, cutting CPU↔GPU round-trips from hundreds per token to one (~2.6× over per-op dispatch).
- Fused GPU prefill (Gemma 4) — dense layers run the whole block (norms, QKV, RoPE, attention, FFN, residuals) as one dispatch per layer during prefill.
- Chunked prefill (Gemma 4) — long prompts are split into bounded chunks to avoid O(n²) attention score tensors for sliding-window layers.
- Fused Qwen 3.5/3.6 attention & FFN — single-graph fused attention-layer decode, fused prefill attention, fused out-proj + FFN, and fused vision encoder blocks (~15 ops → 2).
- Native quantized compute — Q4_K_M, Q6_K, Q8_0, IQ2_XXS, MXFP4 used directly in matmul without expanding to FP32; a batched
AddmmQuantBatchhandles multiple sub-weight matmuls in one dispatch. - Batched GPU MoE — all selected experts (plus the optional shared expert and residual add) collapse into a single GGML graph dispatch per MoE layer.
- KV-cache prefix reuse — multi-turn conversations reuse the longest matching token prefix; sliding-window models back off truncation by the window size.
- Kernel warmup — both CLI and server run a tiny forward pass at startup to pre-compile GPU kernels and warm the pool, avoiding cold-start latency.
Memory optimizations
- Zero-copy file-mapped weights — the GGUF is memory-mapped and quantized tensors bind directly into native ops, removing a per-tensor copy that roughly doubled the resident set. Example:
Qwen3.5-35B-A3B-IQ2_XXS(~10 GB GGUF) runs at ~7 GB peak under Metal instead of ~17 GB. - Best-fit memory pool with bounded retention (blocks capped at 64 MB, pool at 32 blocks) keeps the working set tight across long runs.
- Paged KV block pool with optional SSD spillover — RAM-capped, LRU-evicted, with content-hash prefix reuse across sessions.
- KV block codecs — optional in-place compression with
TurboQuantKvCodec(Q4 / Q8) via--paged-kv-quant-bits, trading a small accuracy cost for half/quarter the per-block footprint.