Inference
Quick Answer
Inference is the phase in which a trained model is used to produce predictions or outputs on new data. While training happens once (or periodically), inference happens every time a user interacts with an AI system, making it the dominant cost in most production deployments.
In Depth
What Inference really means
Inference latency and throughput are critical operational metrics. A model that is 3% more accurate but twice as slow may be the wrong choice for a real-time chatbot but perfect for an overnight batch job.
Inference can occur in the cloud, at the edge (on devices), or fully on-premises. Data sovereignty, privacy and cost considerations usually drive this choice for organisations, especially in regulated sectors.
Why It Matters
Business relevance for UK organisations
Inference cost is where AI budgets quietly balloon. A seemingly cheap token price becomes expensive when multiplied by millions of queries per month. Monitoring and optimising inference is essential for commercial viability.
Real-world example
How this shows up in practice
A Cardiff fintech reduced its monthly AI bill by 62% by caching common inference responses and routing simple queries to a smaller model.
Related Terms
Continue exploring
Model
A model is the trained output of a machine learning process — a collection of learned parameters that, combined with an algorithm, can turn new inputs into predictions or generated content without being explicitly programmed for each case.
TechnicalToken
A token is the basic unit of text that a language model processes. Tokens are usually subword chunks — roughly four characters or three-quarters of a word in English — and both the size of the model's context window and its pricing are typically measured in tokens.
AdvancedMLOps
MLOps is the discipline of operationalising machine learning: the practices, tools and culture needed to deploy, monitor, retrain and govern models reliably in production. It extends DevOps thinking to the unique challenges of data and models.
TechnicalLarge Language Model (LLM)
A Large Language Model (LLM) is a type of neural network trained on vast quantities of text to understand and generate human language. LLMs power chatbots, copilots, content generators and many modern AI features across consumer and business software.