Benchmarks & Testing

How TensorSharp measures performance against itself over time and against other engines, plus the test harness that guards correctness.

Internal regression baseline

Reference numbers measured on Qwen3.6-35B-A3B-UD-IQ2_XXS.gguf (~10 GB on disk; 256 routed experts, 8 active per token; 12 full-attention + 30 GatedDeltaNet recurrent layers) on an Apple M4 Pro with 24 GB unified memory:

Metric	Before (v1 baseline)	After	Change
Process peak memory footprint	~17 GB	~8 GB	−52%
Server resident set after load	~20 GB	~8 GB	−60%
Decode throughput (256 prefill / 64 decode)	~3.8 tok/s	~10.8 tok/s	+2.85×
Decode latency	~264 ms/token	~92 ms/token	−65%

Reproduce with the built-in benchmark:

./TensorSharp.Cli --model Qwen3.6-35B-A3B-UD-IQ2_XXS.gguf --backend ggml_metal \
    --benchmark --bench-prefill 256 --bench-decode 64 --bench-runs 3

The memory reduction comes primarily from no longer copying the GGUF into a separate native heap buffer (it is now mmap-bound zero-copy into Metal command buffers). The throughput gain is largely a side effect of removing that ~10 GB duplicate working set, which previously triggered OS-level memory pressure on machines with ≤24 GB RAM.

Cross-engine inference matrix

For an apples-to-apples comparison of TensorSharp vs llama.cpp vs Ollama on the same on-disk GGUF files (Gemma 4 E4B Q8_0 today, with text / synthetic-prefill / image / audio / video tasks and KV-cache dtype sweeps for f32, f16, and q8_0), see docs/inference_benchmark_matrix.md in the repository. The driver scripts live in benchmarks/inference_matrix/scripts/; per-cell raw JSON outputs are written under benchmarks/inference_matrix/results/ when you run the matrix.

Testing

Unit tests (xUnit)

InferenceWeb.Tests exercises in-process behavior that doesn't require a running server: managed quantized ops, direct CUDA and MLX backend kernels (when the hardware is available), paged-KV scheduling, batched-executor correctness, per-model batched-forward correctness against the legacy path, MTP / NextN speculative-decoding correctness, DiffusionGemma probes, codec round-trips, prompt rendering, and the server CLI options builder.

dotnet test InferenceWeb.Tests/InferenceWeb.Tests.csproj

Server integration tests

Integration tests in TensorSharp.Server/testdata/ cover all three API styles (Web UI SSE, Ollama, OpenAI), multi-turn conversations, thinking mode, tool calling, structured outputs, queue-status compatibility, concurrent requests, and abort support. Architecture-specific features are auto-detected and skipped when the active model doesn't support them.

# Start TensorSharp.Server, then run:
python3 TensorSharp.Server/testdata/test_multiturn.py
# or
bash TensorSharp.Server/testdata/test_multiturn.sh

Inference matrix runner

TensorSharp.TestMatrix is the broader CLI-driven harness for long-running model × backend × feature × env-var coverage. It discovers GGUF files, filters unavailable backends and unsupported prompt types, runs baseline plus env-var sweep cells, writes one JSON result per cell, emits an aggregate Markdown report, and compares against per-host baselines.

dotnet build TensorSharp.TestMatrix/TensorSharp.TestMatrix.csproj -c Release
dotnet run --project TensorSharp.TestMatrix -c Release -- --dry-run

📊

Benchmark numbers depend heavily on hardware, model, quantization, and KV-cache dtype. Treat the figures above as a reproducible reference point on one machine, not a universal guarantee — run the built-in benchmark on your own hardware for the numbers that matter to you.

← Advanced Features Next: API Reference →