Skip to content

variable_clustering

Hierarchical correlation clustering for variable reduction before logistic regression.

CorrVarClus

CorrVarClus(
    max_correlation: float = 0.5,
    max_clusters: int | None = None,
    sample: int = 0,
)

Bases: BaseEstimator

Hierarchical correlation clustering for variable reduction.

Groups features into clusters using average-linkage hierarchical clustering with a correlation distance metric. Ranks features within each cluster by absolute Gini so the most predictive representative can be selected.

Parameters:

Name Type Description Default
max_correlation float

Dendrogram cut height. Features correlated above this threshold end up in the same cluster.

0.5
max_clusters int | None

Hard cap on number of clusters. Overrides max_correlation when set.

None
sample int

Subsample rows before clustering for speed on large datasets. 0 uses all rows.

0

Attributes:

Name Type Description
features_

Column names after dropping zero-variance columns.

labels_

Cluster label per feature (1-indexed).

Z_

Linkage matrix from scipy.cluster.hierarchy.linkage.

corr_line_

The correlation threshold used to cut the dendrogram.

cluster_table_

DataFrame with columns feature, cluster, gini, rank (1 = best in cluster).

Source code in datasci_toolkit/variable_clustering.py
def __init__(
    self,
    max_correlation: float = 0.5,
    max_clusters: int | None = None,
    sample: int = 0,
) -> None:
    self.max_correlation = max_correlation
    self.max_clusters = max_clusters
    self.sample = sample