Back to Glossary
BasicsAI Glossary

Inference

Quick Answer

Inference is the phase in which a trained model is used to produce predictions or outputs on new data. While training happens once (or periodically), inference happens every time a user interacts with an AI system, making it the dominant cost in most production deployments.

In Depth

What Inference really means

Inference latency and throughput are critical operational metrics. A model that is 3% more accurate but twice as slow may be the wrong choice for a real-time chatbot but perfect for an overnight batch job.

Inference can occur in the cloud, at the edge (on devices), or fully on-premises. Data sovereignty, privacy and cost considerations usually drive this choice for organisations, especially in regulated sectors.

Why It Matters

Business relevance for UK organisations

Inference cost is where AI budgets quietly balloon. A seemingly cheap token price becomes expensive when multiplied by millions of queries per month. Monitoring and optimising inference is essential for commercial viability.

Real-world example

How this shows up in practice

A Cardiff fintech reduced its monthly AI bill by 62% by caching common inference responses and routing simple queries to a smaller model.