Hot Deck Imputation¶

class datafusionsm.implicit_model.HotDeck(match_method='nearest', score_method='cosine', minimize=True, importance=None)¶

fuse two data sources together using statistical matching (hot-deck imputation) based on record similarity/distances and information implicit to each

Parameters

match_method (str (default='nearest')) – ‘nearest’, ‘neighbors’, ‘hungarian’, ‘jonker_volgenant’ algorithm used to match records from each data source
score_method (str, callable, optional (default 'cosine')) – similarity/distance measure to compare records. Can be any metric available in scipy.spatial.distance or sklearn.metrics
minimize (boolean (default=True)) – minimize distance between records or maximize score/similarity
importance (str (default=None)) –

critical¶

Critical cells to match within where records must match perfectly

Type: array-like[str]

matches¶

Matched record pairs - record indices or id column values

Type: array-like[tuple[str, str]]

usage¶

How often a donor was used as a matching donor

Type: Counter

imp_wgts¶

Weights defining the impact of each feature when comparing records.

Type: array-like[float]

Examples

>>> from datafusionsm.datasets import load_tv_panel, load_online_survey
>>> from datafusionsm.implicit_model import HotDeck
>>> panel = load_tv_panel()
>>> survey = load_online_survey()
>>> hd = HotDeck(
...     match_method="neighbors", score_method="manhattan"
... ).fit(panel, survey, critical="age,gender")
>>> fused = hd.transform(panel, survey, target="income")

fit(donors, recipients, linking=None, critical=None, imp_wgts=None, target=None, match_args=None, score_args=None, donor_id_col=0, recipient_id_col=0)¶

Fuse two data sources by matching records

Parameters

donors (pandas.DataFrame) – Records containing information to be donated
recipients (pandas.DataFrame) – Records that will receive information
linking (array-like, optional (default=None)) – List of columns that will link the two data sources if None, all overlapping columns will be used
critical (array-like, optional (default=None)) – Features that must match exactly when fusing
imp_wgts (array-like, dict, optional(default=None)) – Importance weights for linking variables
target (string, array-like, optional (default=None)) – What information will be donated. When given here, serves as the reference column if calculating importance weights. If None, importance weights will be by individual columns.
match_args (dict, optional (default=None)) – Additional arguments for matching algorithm See the modules in fusion.implicit.matching for the list of possible matching parameters.
score_args (dict, optional (default=None)) – Additional arguments for scoring method For a list of scoring functions that can be used, look at sklearn.metrics.
ppc_id_col (int = 0) – Index of column serving as donor record index
panel_id_col (int = 0) – Index of column serving as recipient record index

Returns

self

Return type

object

Notes

The data contained in donors and recipients is assumed to have at least a few overlapping features with common values. They should also contain an id column appropriately titled.

transform(donors, recipients, target=None)¶

Using fused ids, impute information from donor data to the recipient data.

Parameters

donors (pandas.DataFrame) – Records containing information to be donated
recipients (pandas.DataFrame) – Records that will receive information
target (string, array-like, optional (default=None)) – What information will be shared from donor data. If None, data will be joined on fused indices and all overlapping fields from donors will have _d suffix.

Returns

ret – New DataFrame containing dontated information

Return type

pandas.DataFrame

Hot Deck Imputation¶

Previous topic

Next topic

This Page