Tagging

Weighted TF-IDF with per-entity Z-score normalization for entity tagging.

WeightedTFIDF

Assigns ranked tags to entities using a weighted TF-IDF score, then normalizes per entity via Z-scores and min-max scaling.

When to use

Product attribute extraction: millions of products, each with hundreds of attribute mentions from reviews and descriptions. You want the top 5-10 tags that genuinely characterize each product. The weight column carries review helpfulness or NLP confidence, so a high-confidence mention of "waterproof" counts more than a passing reference.
Customer interest profiling: a bank wants to know what each customer "is about" -- mortgage customer, travel spender, investor. Tags are merchant categories, value is transaction count, weight is transaction amount. A customer with 2 large investment transfers is more "investor" than one with 50 small coffee purchases.
Support ticket routing: keywords extracted from tickets, weighted by extraction confidence. Subject-line keywords get a higher level than body text. The Z-score normalization means a 3-word ticket and a 500-word ticket both route on their most distinctive terms.
Ad targeting / audience segmentation: users tagged by browsed/purchased product categories, weighted by dwell time or conversion signal. You want the top 3-5 interest tags per user, comparable across power users and casual visitors.
Any (entity, tag, count) problem where you have an external quality signal and need the most distinctive tags per entity -- not the most frequent, not the globally rarest, but the ones that are unusually strong for that specific entity relative to its own distribution.

Basic usage (standard TF-IDF)

import polars as pl
from datasci_toolkit import WeightedTFIDF

df = pl.DataFrame({
    "doc_id": ["A", "A", "A", "B", "B", "B"],
    "term": ["python", "data", "ml", "python", "web", "api"],
    "count": [10, 8, 5, 12, 7, 3],
})

tfidf = WeightedTFIDF(score_threshold=0.1)
result = tfidf.fit_transform(df, entity_col="doc_id", tag_col="term", value_col="count")
print(result)

With external weights and hierarchy

df = pl.DataFrame({
    "product": ["P1", "P1", "P1", "P2", "P2"],
    "attribute": ["durable", "lightweight", "cheap", "durable", "premium"],
    "mentions": [10, 5, 20, 8, 3],
    "confidence": [0.9, 0.7, 0.3, 0.8, 0.9],
    "tier": [1.0, 1.0, 0.5, 1.0, 1.0],
})

tfidf = WeightedTFIDF(weight_col="confidence", level_col="tier")
result = tfidf.fit_transform(
    df, entity_col="product", tag_col="attribute", value_col="mentions"
)
print(result)

The confidence column weights each mention by its reliability. The tier column boosts primary attributes over secondary ones.

Parameters

Parameter	Default	Description
`zscore_thresh`	`2.0`	Tags with Z-score above this are dominant (score=1.0).
`score_threshold`	`0.1`	Minimum final score to retain a tag.
`weight_col`	`None`	Column with external relevance signal. None = all 1.0.
`level_col`	`None`	Column with hierarchy multiplier. None = all 1.0.

How it works

Weighted TF: sum(weight * value) / entity_total — normalized within each entity
IDF: |log10(N / (1 + entity_count))| — penalizes globally common tags
Score: level * TF * IDF
Z-score: per-entity normalization. Single-tag entities get Z=3.0 (dominant)
Dominant tags (Z > threshold): assigned final_score=1.0
Normal tags: min-max scaled to [0, 1] within entity, filtered by score_threshold