Model Selection — AUCStepwiseLogit
Stepwise logistic regression that selects features by Gini improvement rather than p-values.
Why Gini instead of p-values?
P-value stepwise selection optimises for statistical inference. Gini-based selection directly maximises predictive performance — which is what you actually care about in production models.
Additional constraints for regulated environments:
max_correlation— rejects candidates correlated with already-selected featuresenforce_coef_sign— rejects features whose coefficient flips sign when added (scorecard monotonicity)use_cv— evaluate on cross-validation folds instead of a fixed validation set
Basic usage
import numpy as np
import polars as pl
from datasci_toolkit import AUCStepwiseLogit
rng = np.random.default_rng(0)
N = 2000
f0 = rng.normal(0, 1, N)
f1 = rng.normal(0, 1, N)
f2 = rng.normal(0, 1, N)
f0_clone = f0 + rng.normal(0, 0.1, N) # nearly identical to f0
noise = rng.normal(0, 1, N)
logit = 0.8 * f0 + 0.5 * f1 + 0.3 * f2
y = (1 / (1 + np.exp(-logit)) > rng.uniform(size=N)).astype(float)
X_train = pl.DataFrame({
"f0": f0[:1500].tolist(), "f1": f1[:1500].tolist(),
"f2": f2[:1500].tolist(), "f0_clone": f0_clone[:1500].tolist(),
"noise": noise[:1500].tolist(),
})
X_val = pl.DataFrame({
"f0": f0[1500:].tolist(), "f1": f1[1500:].tolist(),
"f2": f2[1500:].tolist(), "f0_clone": f0_clone[1500:].tolist(),
"noise": noise[1500:].tolist(),
})
y_train = pl.Series(y[:1500].tolist())
y_val = pl.Series(y[1500:].tolist())
model = AUCStepwiseLogit(
selection_method="stepwise",
min_increase=0.002,
max_correlation=0.8,
max_predictors=5,
).fit(X_train, y_train, X_val=X_val, y_val=y_val)
print("Selected:", model.predictors_)
print("Validation Gini:", model.score(X_val, y_val))
f0_clone will be rejected because it is correlated > 0.8 with f0.
Inspecting selection progress
progress_ is a DataFrame logging every add/remove step.
# Rows where addrm == 0 = feature was added
print(model.progress_.filter(model.progress_["addrm"] == 0))
Coefficients
print(pl.DataFrame({
"predictor": model.predictors_,
"coefficient": [round(float(c), 4) for c in model.coef_],
}))
Cross-validated selection
Pass use_cv=True to score via k-fold CV on the training set — avoids overfitting the selection to the validation set.
model_cv = AUCStepwiseLogit(
selection_method="forward",
min_increase=0.002,
max_correlation=0.8,
use_cv=True,
cv_folds=5,
cv_seed=42,
).fit(X_train, y_train)
print("CV-selected:", model_cv.predictors_)
Key parameters
| Parameter | Default | Description |
|---|---|---|
selection_method |
"stepwise" |
"forward", "backward", or "stepwise" |
min_increase |
0.005 |
Minimum Gini gain required to add a feature |
max_decrease |
0.0025 |
Maximum Gini drop allowed before removing a feature (stepwise) |
max_predictors |
0 |
Hard cap on number of features (0 = unlimited) |
max_correlation |
1.0 |
Reject candidates correlated above this threshold |
enforce_coef_sign |
False |
Reject features that flip coefficient sign |
use_cv |
False |
Score via cross-validation instead of validation set |
cv_folds |
5 |
Number of CV folds |