Skip to content

label_imputation

Imputation strategies for missing labels — useful when part of the population has unknown outcomes (reject inference, holdout groups, etc.).

KNNLabelImputer

KNNLabelImputer(
    n_neighbors: int = 10,
    metric: str = "minkowski",
    method: str = "weighted",
    cutoff: float = 0.5,
    seed: int = 42,
)

Bases: BaseEstimator

KNN-based label imputation for missing-outcome records.

Finds k nearest labeled neighbours in feature space for each unlabeled record. Distance-weighted average of neighbour labels gives P(event). transform() converts these probabilities to training rows via TargetImputer.

Parameters:

Name Type Description Default
n_neighbors int

Number of nearest neighbours.

10
metric str

Distance metric (any sklearn-compatible string).

'minkowski'
method str

Passed to TargetImputer — how probabilities become rows.

'weighted'
cutoff float

Threshold for method="cutoff".

0.5
seed int

Random seed for method="randomized".

42

Attributes:

Name Type Description
nn_

Fitted NearestNeighbors instance.

y_

Labeled target array.

weights_

Sample weights for labeled records.

Source code in datasci_toolkit/label_imputation.py
def __init__(
    self,
    n_neighbors: int = 10,
    metric: str = "minkowski",
    method: str = "weighted",
    cutoff: float = 0.5,
    seed: int = 42,
) -> None:
    self.n_neighbors = n_neighbors
    self.metric = metric
    self.method = method
    self.cutoff = cutoff
    self.seed = seed

TargetImputer

TargetImputer(
    method: str = "weighted",
    cutoff: float = 0.5,
    seed: int = 42,
)

Bases: BaseEstimator

Converts predicted probabilities into training rows.

Parameters:

Name Type Description Default
method str

"weighted" duplicates each row into (target=1, w=p) + (target=0, w=1-p); "randomized" draws a Bernoulli sample; "cutoff" applies a hard threshold.

'weighted'
cutoff float

Threshold used when method="cutoff".

0.5
seed int

Random seed for method="randomized".

42

Attributes:

Name Type Description
proba_

Probability array stored after fit.

weights_

Sample weight array stored after fit.

Source code in datasci_toolkit/label_imputation.py
def __init__(
    self,
    method: str = "weighted",
    cutoff: float = 0.5,
    seed: int = 42,
) -> None:
    self.method = method
    self.cutoff = cutoff
    self.seed = seed