Skip to content

model_selection

Gini-based stepwise logistic regression for scorecard-style feature selection.

AUCStepwiseLogit

AUCStepwiseLogit(
    initial_predictors: list[str] | None = None,
    all_predictors: list[str] | None = None,
    selection_method: str = "stepwise",
    max_iter: int = 1000,
    min_increase: float = 0.005,
    max_decrease: float = 0.0025,
    max_predictors: int = 0,
    max_correlation: float = 1.0,
    enforce_coef_sign: bool = False,
    penalty: str = "l2",
    C: float = 1000.0,
    correlation_sample: int = 10000,
    use_cv: bool = False,
    cv_folds: int = 5,
    cv_seed: int = 42,
    cv_stratify: bool = True,
)

Bases: BaseEstimator

Gini-based stepwise logistic regression.

Selects features by Gini improvement rather than p-values, with optional correlation filtering, sign enforcement, and cross-validated scoring.

Parameters:

Name Type Description Default
initial_predictors list[str] | None

Features forced into the model at the start.

None
all_predictors list[str] | None

Candidate pool (defaults to all columns in X).

None
selection_method str

"forward", "backward", or "stepwise".

'stepwise'
max_iter int

Maximum number of add/remove steps.

1000
min_increase float

Minimum Gini gain required to add a feature.

0.005
max_decrease float

Maximum Gini drop allowed before removing a feature.

0.0025
max_predictors int

Hard cap on model size (0 = unlimited).

0
max_correlation float

Reject candidates correlated above this with any already-selected feature.

1.0
enforce_coef_sign bool

Reject features that flip a coefficient sign.

False
penalty str

Regularisation type passed to LogisticRegression.

'l2'
C float

Regularisation strength.

1000.0
correlation_sample int

Max rows used for the correlation check.

10000
use_cv bool

Score via k-fold CV instead of a held-out validation set.

False
cv_folds int

Number of CV folds.

5
cv_seed int

Random seed for CV splits.

42
cv_stratify bool

Use stratified folds.

True

Attributes:

Name Type Description
predictors_

Ordered list of selected feature names.

coef_

Coefficients for selected features.

intercept_

Model intercept.

progress_

DataFrame logging each add/remove step with Gini deltas.

Source code in datasci_toolkit/model_selection.py
def __init__(
    self,
    initial_predictors: list[str] | None = None,
    all_predictors: list[str] | None = None,
    selection_method: str = "stepwise",
    max_iter: int = 1000,
    min_increase: float = 0.005,
    max_decrease: float = 0.0025,
    max_predictors: int = 0,
    max_correlation: float = 1.0,
    enforce_coef_sign: bool = False,
    penalty: str = "l2",
    C: float = 1000.0,
    correlation_sample: int = 10000,
    use_cv: bool = False,
    cv_folds: int = 5,
    cv_seed: int = 42,
    cv_stratify: bool = True,
) -> None:
    self.initial_predictors = initial_predictors
    self.all_predictors = all_predictors
    self.selection_method = selection_method
    self.max_iter = max_iter
    self.min_increase = min_increase
    self.max_decrease = max_decrease
    self.max_predictors = max_predictors
    self.max_correlation = max_correlation
    self.enforce_coef_sign = enforce_coef_sign
    self.penalty = penalty
    self.C = C
    self.correlation_sample = correlation_sample
    self.use_cv = use_cv
    self.cv_folds = cv_folds
    self.cv_seed = cv_seed
    self.cv_stratify = cv_stratify