Back to Glossary
BasicsAI Glossary

Training Data

Quick Answer

Training data is the dataset used to teach a machine learning model the patterns it needs to perform its task. The quality, quantity, diversity and recency of training data directly determine how accurate and fair the resulting model will be.

In Depth

What Training Data really means

Training data usually consists of input examples and, for supervised learning, the correct output labels. Preparing training data typically involves collection, cleaning, de-duplication, labelling and splitting into training, validation and test sets.

Poor quality data is the single biggest cause of disappointing AI projects. Biased data produces biased models; stale data produces models that fail in production. Investing in data preparation almost always pays off more than chasing exotic algorithms.

Why It Matters

Business relevance for UK organisations

UK businesses must also consider UK GDPR when using personal data for training. Lawful basis, purpose limitation, data minimisation and the right to erasure all shape what data can be used and how.

Real-world example

How this shows up in practice

A Bristol HR-tech vendor discovered its CV-screening model underperformed for candidates from non-Russell Group universities because its training data over-represented a narrow set of employers.