datasci-toolkit

Polars-native Python toolkit for binary classification, scorecard development, and model validation.

Modules

Monitoring & Stability

Module	Classes / Functions	Use case
`stability`	`PSI`, `ESI`, `StabilityMonitor`	Detect population drift between training and production -- catch when input distributions shift before model performance degrades
`metrics`	`gini`, `ks`, `lift`, `iv`, `BootstrapGini`, `feature_power`, `AUCStability`	Evaluate binary classifiers with confidence intervals -- report Gini/KS/lift by month, identify which features drive predictive power, and track model stability over time with a single scalar metric

Feature Engineering & Selection

Module	Classes / Functions	Use case
`grouping`	`StabilityGrouping`, `WOETransformer`	Bin continuous features into stable WOE-encoded groups for scorecard development -- ensures bins don't drift across time periods
`feature_elimination`	`ShapImportance`, `ShapRFE`	Reduce a 500-feature dataset to the 20 that matter -- backward elimination using SHAP values with cross-validation
`variable_clustering`	`CorrVarClus`	Remove redundant features before modeling -- hierarchical clustering picks one representative from each correlated group
`temporal`	`TemporalFeatureEngineer`	Generate time-windowed aggregations (sum/mean/max over 30d/90d/1y) from transaction histories for credit scoring or churn prediction
`tagging`	`WeightedTFIDF`	Profile entities with ranked tags -- find top product attributes from reviews, build customer interest profiles from transactions, with external quality signals
`interactions`	`BinnedInteractionEncoder`, `QuantileBinner`, `OptimalBinner`, `PrecomputedBinner`, `OptimalBinning2D`, `ContinuousOptimalBinning2D`	Discover and encode pairwise feature interactions -- bin two features independently, combine into joint index, WOE-encode to capture non-linear effects like age-x-income risk profiles

Model Building & Post-processing

Module	Classes / Functions	Use case
`model_selection`	`AUCStepwiseLogit`	Build interpretable scorecards -- stepwise logistic regression that adds features by Gini lift and enforces correlation constraints
`bin_editor`	`BinEditor`, `BinEditorWidget`	Manually adjust bin boundaries after auto-binning -- headless API for pipelines, interactive widget for notebooks
`label_imputation`	`KNNLabelImputer`, `TargetImputer`	Recover labels for rejected loan applications (reject inference) or fill missing targets in semi-supervised settings
`smoothing`	`PoissonSmoother`, `PredictionSmoother`	Stabilize noisy count features before modeling (Poisson), or eliminate monthly prediction jitter so customers don't flip between risk tiers

Install

pip install datasci-toolkit

Quick start

import polars as pl
from lightgbm import LGBMClassifier
from datasci_toolkit import ShapRFE, StabilityGrouping, AUCStepwiseLogit

# 1. Stability-constrained binning
sg = StabilityGrouping(stability_threshold=0.1).fit(
    X_train, y_train, t_train=month_train,
    X_val=X_val, y_val=y_val, t_val=month_val,
)
X_woe = sg.transform(X_test)

# 2. SHAP-based feature elimination
rfe = ShapRFE(
    model=LGBMClassifier(n_estimators=100, verbose=-1),
    step=1, cv=5, min_features_to_select=5,
).fit(X_woe, y_train)
features = rfe.get_reduced_features("best_parsimonious")

# 3. Stepwise logistic regression
model = AUCStepwiseLogit(max_predictors=10, max_correlation=0.8).fit(
    X_woe.select(features), y_train,
    X_val=X_val_woe.select(features), y_val=y_val,
)

Stack

Python 3.12, polars -- no pandas
scikit-learn estimator conventions (fit / transform / score)
shap + lightgbm + xgboost for SHAP-based feature selection
matplotlib for standalone plot functions