label_imputation

Imputation strategies for missing labels — useful when part of the population has unknown outcomes (reject inference, holdout groups, etc.).

KNNLabelImputer

KNNLabelImputer(
    n_neighbors: int = 10,
    metric: str = "minkowski",
    method: str = "weighted",
    cutoff: float = 0.5,
    seed: int = 42,
)

Bases: BaseEstimator

KNN-based label imputation for missing-outcome records.

Finds k nearest labeled neighbours in feature space for each unlabeled record. Distance-weighted average of neighbour labels gives P(event). transform() converts these probabilities to training rows via TargetImputer.

Parameters:

Name	Type	Description	Default
`n_neighbors`	`int`	Number of nearest neighbours.	`10`
`metric`	`str`	Distance metric (any sklearn-compatible string).	`'minkowski'`
`method`	`str`	Passed to `TargetImputer` — how probabilities become rows.	`'weighted'`
`cutoff`	`float`	Threshold for `method="cutoff"`.	`0.5`
`seed`	`int`	Random seed for `method="randomized"`.	`42`

Attributes:

Name	Type	Description
`nn_`		Fitted `NearestNeighbors` instance.
`y_`		Labeled target array.
`weights_`		Sample weights for labeled records.

Source code in datasci_toolkit/label_imputation.py

def __init__(
    self,
    n_neighbors: int = 10,
    metric: str = "minkowski",
    method: str = "weighted",
    cutoff: float = 0.5,
    seed: int = 42,
) -> None:
    self.n_neighbors = n_neighbors
    self.metric = metric
    self.method = method
    self.cutoff = cutoff
    self.seed = seed

TargetImputer

TargetImputer(
    method: str = "weighted",
    cutoff: float = 0.5,
    seed: int = 42,
)

Bases: BaseEstimator

Converts predicted probabilities into training rows.

Parameters:

Name	Type	Description	Default
`method`	`str`	`"weighted"` duplicates each row into `(target=1, w=p)` + `(target=0, w=1-p)`; `"randomized"` draws a Bernoulli sample; `"cutoff"` applies a hard threshold.	`'weighted'`
`cutoff`	`float`	Threshold used when `method="cutoff"`.	`0.5`
`seed`	`int`	Random seed for `method="randomized"`.	`42`

Attributes:

Name	Type	Description
`proba_`		Probability array stored after `fit`.
`weights_`		Sample weight array stored after `fit`.

Source code in datasci_toolkit/label_imputation.py

def __init__(
    self,
    method: str = "weighted",
    cutoff: float = 0.5,
    seed: int = 42,
) -> None:
    self.method = method
    self.cutoff = cutoff
    self.seed = seed