Glossary & FAQ
New to local LLMs? Start here. Plain-language definitions of the terms used across this wiki, followed by the questions people ask most.
Glossary
| Term | What it means |
|---|---|
| LLM (large language model) | A neural network trained to predict text. You give it a prompt; it produces a continuation. |
| Inference | Running a trained model to get an answer (as opposed to training, which creates the model). TensorSharp is an inference engine. |
| GGUF | A single-file model format that packs the model's weights and metadata together. TensorSharp loads GGUF files, typically downloaded from Hugging Face. |
| Quantization | Compressing model weights to fewer bits (e.g. Q4_K_M ≈ 4-bit, Q8_0 ≈ 8-bit) so a model fits in less memory and runs faster, with a small quality trade-off. |
| Token | The unit a model reads and writes — roughly a word piece. "Tokens per second" measures speed; "context length" is how many tokens fit at once. |
| Tokenizer | Converts text to tokens and back. TensorSharp ships SentencePiece and BPE tokenizers. |
| Prefill vs decode | Prefill processes your whole prompt at once; decode generates the reply one token at a time. They have different performance characteristics. |
| KV cache | Stored attention "keys and values" from earlier tokens, so the model doesn't recompute them. It makes long contexts and multi-turn chats efficient. → paged KV |
| TTFT | "Time to first token" — how long before the answer starts streaming. KV-cache reuse reduces it on follow-up turns. |
| Backend | The hardware path that runs the math: CPU, NVIDIA GPU (CUDA), or Apple Silicon GPU (Metal/MLX). → Backends |
| MoE (Mixture of Experts) | A model that routes each token to a few specialized sub-networks ("experts") instead of the whole network, giving large capacity at lower compute per token. |
| Multimodal | Able to take more than text — images, audio, or video — as input. → Multimodal |
| Thinking / reasoning | The model produces hidden step-by-step reasoning before its final answer; TensorSharp separates the two. → Thinking mode |
| Tool / function calling | The model can ask your application to run a function (e.g. "get_weather") and use the result. → Tool calling |
| Continuous batching | Serving many users at once by interleaving their requests on a single hosted model, instead of one at a time. → Deep dive |
| Speculative decoding | A small "draft" head guesses several tokens; the main model verifies them in one pass — faster output, identical result. → MTP |
| Projector (mmproj) | A companion file that lets a multimodal model understand images/audio. Place it next to the model or pass --mmproj. |
| Sampling | How the next token is chosen from the model's probabilities. Temperature, top-p, and top-k control creativity vs. determinism. → Sampling |
Frequently asked questions
Do I need a GPU?
No. TensorSharp runs on plain CPU (--backend cpu or the faster ggml_cpu). A GPU — NVIDIA (CUDA) or Apple Silicon (Metal/MLX) — makes generation substantially faster, especially for larger models. See Backends.
Which model should I start with?
Gemma-4-E4B (Q8_0) is a small, well-tested starting point. For lower memory, choose a Q4_K_M quantization; for higher quality, a larger model or Q8_0. See Model downloads.
Is my data private?
Yes — inference happens entirely on your machine. Prompts, documents, and images are not sent to any external service. That is the main reason organizations choose local inference. See Business value.
Can I keep using my existing OpenAI / Ollama code?
Yes. The server speaks both wire formats. Point an OpenAI client at http://localhost:5000/v1 (any API key) or an Ollama client at http://localhost:5000/api/…. See the HTTP API.
How do I serve many users at once?
Run TensorSharp.Server. Its continuous-batching engine interleaves concurrent requests against one hosted model and shares common prompt prefixes across them, which keeps throughput high. See Continuous batching.
How can I make it faster?
- Use a GPU backend (
ggml_cudaorggml_metal). - Pick a smaller or more aggressively quantized model.
- On supported models, enable speculative decoding with
--mtp-spec. - Reuse sessions so the KV-cache prefix carries over between turns.
What does it cost?
There are no per-token fees — you run open models on hardware you control. The cost is the hardware and the electricity to run it. See Business value.
What platforms are supported?
Windows, Linux, and macOS (Apple Silicon), on .NET 10. Pre-built self-contained binaries are published for Windows/Linux (CPU and CUDA) and macOS arm64. See Getting Started and platform binaries.
How is it licensed?
TensorSharp is authored by Zhongkai Fu and distributed under the license in the repository's LICENSE file (BSD-3-Clause). Each model you download has its own separate license from its publisher — check the model card on Hugging Face.
Where do I report issues or contribute?
On the GitHub repository.