variable_clustering

Hierarchical correlation clustering for variable reduction before logistic regression.

CorrVarClus

CorrVarClus(
    max_correlation: float = 0.5,
    max_clusters: int | None = None,
    sample: int = 0,
)

Bases: BaseEstimator

Hierarchical correlation clustering for variable reduction.

Groups features into clusters using average-linkage hierarchical clustering with a correlation distance metric. Ranks features within each cluster by absolute Gini so the most predictive representative can be selected.

Parameters:

Name	Type	Description	Default
`max_correlation`	`float`	Dendrogram cut height. Features correlated above this threshold end up in the same cluster.	`0.5`
`max_clusters`	`int \| None`	Hard cap on number of clusters. Overrides `max_correlation` when set.	`None`
`sample`	`int`	Subsample rows before clustering for speed on large datasets. `0` uses all rows.	`0`

Attributes:

Name	Type	Description
`features_`		Column names after dropping zero-variance columns.
`labels_`		Cluster label per feature (1-indexed).
`Z_`		Linkage matrix from `scipy.cluster.hierarchy.linkage`.
`corr_line_`		The correlation threshold used to cut the dendrogram.
`cluster_table_`		DataFrame with columns `feature`, `cluster`, `gini`, `rank` (1 = best in cluster).

Source code in datasci_toolkit/variable_clustering.py

def __init__(
    self,
    max_correlation: float = 0.5,
    max_clusters: int | None = None,
    sample: int = 0,
) -> None:
    self.max_correlation = max_correlation
    self.max_clusters = max_clusters
    self.sample = sample