Member-only story
data leakage in ml and dl : understanding, detecting and fixing
3 min readFeb 15, 2025
data leakage occurs when a model accesses information it should not see during training. this skews its performance, making it overly optimistic on test data and unable to generalize in production. in simple terms, the model unknowingly cheats and produces misleading results.
when does this happen?
data leakage can occur at different levels:
- during data preprocessing: some transformations are applied before the train/test split, such as scaling computed on the entire dataset.
- in feature selection: if we select the most correlated variables to the target on the full dataset before splitting into train/test, we introduce leakage.
- in labels: if a predictive variable contains a delayed version of the label (example: a transaction history including an already identified fraud), the model learns a trivial rule rather than a true underlying relationship.
- in temporal data: if we test on later dates than the training ones but some variables are computed using future data, the model accesses information that would be impossible to obtain in real conditions.