User Guide
This guide covers the stats-based workflow for leak-free label normalization in machine learning.
⚠️ Critical: Always Use the Stats-Based Workflow
NEVER use normalize_labels()
directly on your full dataset. This causes data leakage! Instead, follow this pattern:
- Compute stats from training data ONLY
- Apply the same stats to validation/test data
- Denormalize predictions using the same stats
The Three-Step Pattern
Step 1: Compute Normalization Statistics (Training Data Only)
using RealLabelNormalization
# Your training labels (with outliers)
train_labels = [1.2, 5.8, 3.4, 8.1, 2.3, 100.5] # 100.5 is an outlier
# Compute stats from training data ONLY
stats = compute_normalization_stats(train_labels; method=:zscore, clip_quantiles=(0.01, 0.99))
Step 2: Apply Stats to All Data Splits
# Apply SAME stats to training data
train_normalized = apply_normalization(train_labels, stats)
# Apply SAME stats to validation data
val_labels = [2.1, 4.3, 6.7, 9.2]
val_normalized = apply_normalization(val_labels, stats)
# Apply SAME stats to test data
test_labels = [1.8, 5.1, 7.3]
test_normalized = apply_normalization(test_labels, stats)
Step 3: Denormalize Predictions
# After training your model on normalized data...
predictions_normalized = model(X_test) # Model outputs normalized predictions
# Convert back to original scale using SAME stats
predictions_original = denormalize_labels(predictions_normalized, stats)
Multi-Target Regression (Stats-Based)
For multiple target variables, follow the same three-step pattern:
# Training data: each column is a different target
train_labels = [1.0 10.0 100.0;
5.0 20.0 200.0;
3.0 15.0 150.0;
8.0 25.0 250.0]
# Step 1: Compute stats from training data ONLY
stats = compute_normalization_stats(train_labels; mode=:columnwise, method=:zscore)
# Step 2: Apply SAME stats to all splits
train_normalized = apply_normalization(train_labels, stats)
val_normalized = apply_normalization(val_labels, stats) # Same stats
test_normalized = apply_normalization(test_labels, stats) # Same stats
# Step 3: Denormalize predictions using SAME stats
predictions_original = denormalize_labels(predictions_normalized, stats)
Normalization Methods (Stats-Based)
Min-Max Normalization
Scales values to a specified range (default: [-1, 1]
):
# Step 1: Compute stats from training data
stats = compute_normalization_stats(train_labels; method=:minmax, range=(-1, 1))
# Step 2: Apply to all splits
train_norm = apply_normalization(train_labels, stats)
test_norm = apply_normalization(test_labels, stats)
# Step 3: Denormalize predictions
predictions_original = denormalize_labels(predictions_normalized, stats)
Z-Score Normalization
Standardizes values to have zero mean and unit variance:
# Step 1: Compute stats from training data
stats = compute_normalization_stats(train_labels; method=:zscore)
# Step 2: Apply to all splits
train_norm = apply_normalization(train_labels, stats)
test_norm = apply_normalization(test_labels, stats)
# Step 3: Denormalize predictions
predictions_original = denormalize_labels(predictions_normalized, stats)
Outlier Handling (Stats-Based)
Quantile-Based Clipping
Configure clipping when computing stats from training data:
# Step 1: Compute stats with outlier clipping (training data only)
stats = compute_normalization_stats(train_labels;
method=:zscore,
clip_quantiles=(0.01, 0.99) # Default: clip to 1st-99th percentiles
)
# Step 2: Apply to all splits (same clipping applied)
train_norm = apply_normalization(train_labels, stats)
test_norm = apply_normalization(test_labels, stats)
# Step 3: Denormalize predictions
predictions_original = denormalize_labels(predictions_normalized, stats)
Why Clip Outliers?
Outliers can severely distort normalization, especially min-max scaling:
train_labels = [1, 2, 3, 4, 5, 1000] # 1000 is an outlier
# Step 1: Compute stats with clipping
stats_with_clip = compute_normalization_stats(train_labels; clip_quantiles=(0.1, 0.9))
# Step 2: Apply to test data
test_labels = [1.5, 2.5, 3.5]
test_norm = apply_normalization(test_labels, stats_with_clip)
# Result: better distribution because outlier was clipped during stats computation
Handling Missing Data (Stats-Based)
NaN values are handled gracefully in the stats-based workflow:
# Training data with missing values
train_with_nan = [1.0, 2.0, NaN, 4.0, 5.0, 100.0]
# Step 1: Compute stats from valid training data only
stats = compute_normalization_stats(train_with_nan) # Uses [1.0, 2.0, 4.0, 5.0, 100.0]
# Step 2: Apply to all splits (NaN positions preserved)
train_norm = apply_normalization(train_with_nan, stats) # NaNs preserved
test_norm = apply_normalization(test_with_nan, stats) # NaNs preserved, same stats
# Step 3: Denormalize predictions
predictions_original = denormalize_labels(predictions_normalized, stats)
Complete Machine Learning Workflow
Here's the complete pattern for any ML project:
# Step 1: Compute normalization statistics on training data ONLY
train_labels = [1.0, 2.0, 3.0, 4.0, 5.0, 100.0] # With outlier
stats = compute_normalization_stats(train_labels; method=:zscore, clip_quantiles=(0.01, 0.99))
# Step 2: Apply SAME stats to all data splits
train_normalized = apply_normalization(train_labels, stats)
val_normalized = apply_normalization(val_labels, stats) # Same stats
test_normalized = apply_normalization(test_labels, stats) # Same stats
# Step 3: Train model on normalized data
# model = train_model(X_train, train_normalized)
# Step 4: Make predictions and denormalize using SAME stats
predictions_normalized = model(X_test)
predictions_original = denormalize_labels(predictions_normalized, stats)
Best Practices (Stats-Based Workflow)
When to Use Each Method
- Min-max normalization: When you know the expected range of your data or want bounded outputs
- Z-score normalization: When your data is approximately normally distributed
- Global mode: When all targets should be on the same scale (e.g., related measurements)
- Column-wise mode: When targets represent different quantities with different scales
The Golden Rule: Always Use Stats-Based Workflow
# ✅ CORRECT: Stats-based workflow (prevents data leakage)
stats = compute_normalization_stats(train_labels) # Training data only
train_norm = apply_normalization(train_labels, stats)
test_norm = apply_normalization(test_labels, stats) # Same stats
predictions_original = denormalize_labels(predictions_normalized, stats)
# ❌ WRONG: Direct normalization (causes data leakage)
# train_norm = normalize_labels(train_labels)
# test_norm = normalize_labels(test_labels) # Different stats!
Cross-Validation with Consistent Stats
# For each CV fold, compute stats on training portion only
for fold in 1:5
train_idx, val_idx = get_cv_indices(fold)
# Step 1: Compute stats on training fold only
fold_stats = compute_normalization_stats(y_train[train_idx])
# Step 2: Apply to both training and validation portions
y_train_norm = apply_normalization(y_train[train_idx], fold_stats)
y_val_norm = apply_normalization(y_train[val_idx], fold_stats) # Same stats!
# Step 3: Train and validate model
model = train_model(X_train[train_idx], y_train_norm)
val_pred_norm = model(X_train[val_idx])
val_pred_original = denormalize_labels(val_pred_norm, fold_stats)
end
Handling Extreme Outliers
Configure clipping when computing stats from training data:
# For data with extreme outliers (e.g., financial data)
stats = compute_normalization_stats(train_labels; clip_quantiles=(0.1, 0.9))
# For very clean data, you might skip clipping
stats = compute_normalization_stats(train_labels; clip_quantiles=nothing)
# Apply same clipping to all splits
train_norm = apply_normalization(train_labels, stats)
test_norm = apply_normalization(test_labels, stats)