Skip to content

datasci-toolkit

Polars-native Python toolkit for binary classification, scorecard development, and model validation.

Modules

Monitoring & Stability

Module Classes / Functions Use case
stability PSI, ESI, StabilityMonitor Detect population drift between training and production -- catch when input distributions shift before model performance degrades
metrics gini, ks, lift, iv, BootstrapGini, feature_power, AUCStability Evaluate binary classifiers with confidence intervals -- report Gini/KS/lift by month, identify which features drive predictive power, and track model stability over time with a single scalar metric

Feature Engineering & Selection

Module Classes / Functions Use case
grouping StabilityGrouping, WOETransformer Bin continuous features into stable WOE-encoded groups for scorecard development -- ensures bins don't drift across time periods
feature_elimination ShapImportance, ShapRFE Reduce a 500-feature dataset to the 20 that matter -- backward elimination using SHAP values with cross-validation
variable_clustering CorrVarClus Remove redundant features before modeling -- hierarchical clustering picks one representative from each correlated group
temporal TemporalFeatureEngineer Generate time-windowed aggregations (sum/mean/max over 30d/90d/1y) from transaction histories for credit scoring or churn prediction
tagging WeightedTFIDF Profile entities with ranked tags -- find top product attributes from reviews, build customer interest profiles from transactions, with external quality signals
interactions BinnedInteractionEncoder, QuantileBinner, OptimalBinner, PrecomputedBinner, OptimalBinning2D, ContinuousOptimalBinning2D Discover and encode pairwise feature interactions -- bin two features independently, combine into joint index, WOE-encode to capture non-linear effects like age-x-income risk profiles

Model Building & Post-processing

Module Classes / Functions Use case
model_selection AUCStepwiseLogit Build interpretable scorecards -- stepwise logistic regression that adds features by Gini lift and enforces correlation constraints
bin_editor BinEditor, BinEditorWidget Manually adjust bin boundaries after auto-binning -- headless API for pipelines, interactive widget for notebooks
label_imputation KNNLabelImputer, TargetImputer Recover labels for rejected loan applications (reject inference) or fill missing targets in semi-supervised settings
smoothing PoissonSmoother, PredictionSmoother Stabilize noisy count features before modeling (Poisson), or eliminate monthly prediction jitter so customers don't flip between risk tiers

Install

pip install datasci-toolkit

Quick start

import polars as pl
from lightgbm import LGBMClassifier
from datasci_toolkit import ShapRFE, StabilityGrouping, AUCStepwiseLogit

# 1. Stability-constrained binning
sg = StabilityGrouping(stability_threshold=0.1).fit(
    X_train, y_train, t_train=month_train,
    X_val=X_val, y_val=y_val, t_val=month_val,
)
X_woe = sg.transform(X_test)

# 2. SHAP-based feature elimination
rfe = ShapRFE(
    model=LGBMClassifier(n_estimators=100, verbose=-1),
    step=1, cv=5, min_features_to_select=5,
).fit(X_woe, y_train)
features = rfe.get_reduced_features("best_parsimonious")

# 3. Stepwise logistic regression
model = AUCStepwiseLogit(max_predictors=10, max_correlation=0.8).fit(
    X_woe.select(features), y_train,
    X_val=X_val_woe.select(features), y_val=y_val,
)

Stack

  • Python 3.12, polars -- no pandas
  • scikit-learn estimator conventions (fit / transform / score)
  • shap + lightgbm + xgboost for SHAP-based feature selection
  • matplotlib for standalone plot functions