Glossary & FAQ

New to local LLMs? Start here. Plain-language definitions of the terms used across this wiki, followed by the questions people ask most.

Glossary

Term	What it means
LLM (large language model)	A neural network trained to predict text. You give it a prompt; it produces a continuation.
Inference	Running a trained model to get an answer (as opposed to training, which creates the model). TensorSharp is an inference engine.
GGUF	A single-file model format that packs the model's weights and metadata together. TensorSharp loads GGUF files, typically downloaded from Hugging Face.
Quantization	Compressing model weights to fewer bits (e.g. `Q4_K_M` ≈ 4-bit, `Q8_0` ≈ 8-bit) so a model fits in less memory and runs faster, with a small quality trade-off.
Token	The unit a model reads and writes — roughly a word piece. "Tokens per second" measures speed; "context length" is how many tokens fit at once.
Tokenizer	Converts text to tokens and back. TensorSharp ships SentencePiece and BPE tokenizers.
Prefill vs decode	Prefill processes your whole prompt at once; decode generates the reply one token at a time. They have different performance characteristics.
KV cache	Stored attention "keys and values" from earlier tokens, so the model doesn't recompute them. It makes long contexts and multi-turn chats efficient. → paged KV
TTFT	"Time to first token" — how long before the answer starts streaming. KV-cache reuse reduces it on follow-up turns.
Backend	The hardware path that runs the math: CPU, NVIDIA GPU (CUDA), or Apple Silicon GPU (Metal/MLX). → Backends
MoE (Mixture of Experts)	A model that routes each token to a few specialized sub-networks ("experts") instead of the whole network, giving large capacity at lower compute per token.
Multimodal	Able to take more than text — images, audio, or video — as input. → Multimodal
Thinking / reasoning	The model produces hidden step-by-step reasoning before its final answer; TensorSharp separates the two. → Thinking mode
Tool / function calling	The model can ask your application to run a function (e.g. "get_weather") and use the result. → Tool calling
Continuous batching	Serving many users at once by interleaving their requests on a single hosted model, instead of one at a time. → Deep dive
Speculative decoding	A small "draft" head guesses several tokens; the main model verifies them in one pass — faster output, identical result. → MTP
Projector (mmproj)	A companion file that lets a multimodal model understand images/audio. Place it next to the model or pass `--mmproj`.
Sampling	How the next token is chosen from the model's probabilities. Temperature, top-p, and top-k control creativity vs. determinism. → Sampling

Frequently asked questions

Do I need a GPU?

No. TensorSharp runs on plain CPU (--backend cpu or the faster ggml_cpu). A GPU — NVIDIA (CUDA) or Apple Silicon (Metal/MLX) — makes generation substantially faster, especially for larger models. See Backends.

Which model should I start with?

Gemma-4-E4B (Q8_0) is a small, well-tested starting point. For lower memory, choose a Q4_K_M quantization; for higher quality, a larger model or Q8_0. See Model downloads.

Is my data private?

Yes — inference happens entirely on your machine. Prompts, documents, and images are not sent to any external service. That is the main reason organizations choose local inference. See Business value.

Can I keep using my existing OpenAI / Ollama code?

Yes. The server speaks both wire formats. Point an OpenAI client at http://localhost:5000/v1 (any API key) or an Ollama client at http://localhost:5000/api/…. See the HTTP API.

How do I serve many users at once?

Run TensorSharp.Server. Its continuous-batching engine interleaves concurrent requests against one hosted model and shares common prompt prefixes across them, which keeps throughput high. See Continuous batching.

How can I make it faster?

Use a GPU backend (ggml_cuda or ggml_metal).
Pick a smaller or more aggressively quantized model.
On supported models, enable speculative decoding with --mtp-spec.
Reuse sessions so the KV-cache prefix carries over between turns.

What does it cost?

There are no per-token fees — you run open models on hardware you control. The cost is the hardware and the electricity to run it. See Business value.

What platforms are supported?

Windows, Linux, and macOS (Apple Silicon), on .NET 10. Pre-built self-contained binaries are published for Windows/Linux (CPU and CUDA) and macOS arm64. See Getting Started and platform binaries.

How is it licensed?

TensorSharp is authored by Zhongkai Fu and distributed under the license in the repository's LICENSE file (BSD-3-Clause). Each model you download has its own separate license from its publisher — check the model card on Hugging Face.

Where do I report issues or contribute?

On the GitHub repository.

← API Reference Back to Home →