Benchmarks & Testing
How TensorSharp measures performance against itself over time and against other engines, plus the test harness that guards correctness.
Internal regression baseline
Reference numbers measured on Qwen3.6-35B-A3B-UD-IQ2_XXS.gguf (~10 GB on disk; 256 routed experts, 8 active per token; 12 full-attention + 30 GatedDeltaNet recurrent layers) on an Apple M4 Pro with 24 GB unified memory:
| Metric | Before (v1 baseline) | After | Change |
|---|---|---|---|
| Process peak memory footprint | ~17 GB | ~8 GB | −52% |
| Server resident set after load | ~20 GB | ~8 GB | −60% |
| Decode throughput (256 prefill / 64 decode) | ~3.8 tok/s | ~10.8 tok/s | +2.85× |
| Decode latency | ~264 ms/token | ~92 ms/token | −65% |
Reproduce with the built-in benchmark:
./TensorSharp.Cli --model Qwen3.6-35B-A3B-UD-IQ2_XXS.gguf --backend ggml_metal \
--benchmark --bench-prefill 256 --bench-decode 64 --bench-runs 3
The memory reduction comes primarily from no longer copying the GGUF into a separate native heap buffer (it is now mmap-bound zero-copy into Metal command buffers). The throughput gain is largely a side effect of removing that ~10 GB duplicate working set, which previously triggered OS-level memory pressure on machines with ≤24 GB RAM.
Cross-engine inference matrix
For an apples-to-apples comparison of TensorSharp vs llama.cpp vs Ollama on the same on-disk GGUF files (Gemma 4 E4B Q8_0 today, with text / synthetic-prefill / image / audio / video tasks and KV-cache dtype sweeps for f32, f16, and q8_0), see docs/inference_benchmark_matrix.md in the repository. The driver scripts live in benchmarks/inference_matrix/scripts/; per-cell raw JSON outputs are written under benchmarks/inference_matrix/results/ when you run the matrix.
Testing
Unit tests (xUnit)
InferenceWeb.Tests exercises in-process behavior that doesn't require a running server: managed quantized ops, direct CUDA and MLX backend kernels (when the hardware is available), paged-KV scheduling, batched-executor correctness, per-model batched-forward correctness against the legacy path, MTP / NextN speculative-decoding correctness, DiffusionGemma probes, codec round-trips, prompt rendering, and the server CLI options builder.
dotnet test InferenceWeb.Tests/InferenceWeb.Tests.csproj
Server integration tests
Integration tests in TensorSharp.Server/testdata/ cover all three API styles (Web UI SSE, Ollama, OpenAI), multi-turn conversations, thinking mode, tool calling, structured outputs, queue-status compatibility, concurrent requests, and abort support. Architecture-specific features are auto-detected and skipped when the active model doesn't support them.
# Start TensorSharp.Server, then run:
python3 TensorSharp.Server/testdata/test_multiturn.py
# or
bash TensorSharp.Server/testdata/test_multiturn.sh
Inference matrix runner
TensorSharp.TestMatrix is the broader CLI-driven harness for long-running model × backend × feature × env-var coverage. It discovers GGUF files, filters unavailable backends and unsupported prompt types, runs baseline plus env-var sweep cells, writes one JSON result per cell, emits an aggregate Markdown report, and compares against per-host baselines.
dotnet build TensorSharp.TestMatrix/TensorSharp.TestMatrix.csproj -c Release
dotnet run --project TensorSharp.TestMatrix -c Release -- --dry-run
Benchmark numbers depend heavily on hardware, model, quantization, and KV-cache dtype. Treat the figures above as a reproducible reference point on one machine, not a universal guarantee — run the built-in benchmark on your own hardware for the numbers that matter to you.