Hot Deck Imputation

class datafusionsm.implicit_model.HotDeck(match_method='nearest', score_method='cosine', minimize=True, importance=None)

fuse two data sources together using statistical matching (hot-deck imputation) based on record similarity/distances and information implicit to each

Parameters
  • match_method (str (default='nearest')) – ‘nearest’, ‘neighbors’, ‘hungarian’, ‘jonker_volgenant’ algorithm used to match records from each data source

  • score_method (str, callable, optional (default 'cosine')) – similarity/distance measure to compare records. Can be any metric available in scipy.spatial.distance or sklearn.metrics

  • minimize (boolean (default=True)) – minimize distance between records or maximize score/similarity

  • importance (str (default=None)) –

critical

Critical cells to match within where records must match perfectly

Type

array-like[str]

matches

Matched record pairs - record indices or id column values

Type

array-like[tuple[str, str]]

usage

How often a donor was used as a matching donor

Type

Counter

imp_wgts

Weights defining the impact of each feature when comparing records.

Type

array-like[float]

Examples

>>> from datafusionsm.datasets import load_tv_panel, load_online_survey
>>> from datafusionsm.implicit_model import HotDeck
>>> panel = load_tv_panel()
>>> survey = load_online_survey()
>>> hd = HotDeck(
...     match_method="neighbors", score_method="manhattan"
... ).fit(panel, survey, critical="age,gender")
>>> fused = hd.transform(panel, survey, target="income")
fit(donors, recipients, linking=None, critical=None, imp_wgts=None, target=None, match_args=None, score_args=None, donor_id_col=0, recipient_id_col=0)

Fuse two data sources by matching records

Parameters
  • donors (pandas.DataFrame) – Records containing information to be donated

  • recipients (pandas.DataFrame) – Records that will receive information

  • linking (array-like, optional (default=None)) – List of columns that will link the two data sources if None, all overlapping columns will be used

  • critical (array-like, optional (default=None)) – Features that must match exactly when fusing

  • imp_wgts (array-like, dict, optional(default=None)) – Importance weights for linking variables

  • target (string, array-like, optional (default=None)) – What information will be donated. When given here, serves as the reference column if calculating importance weights. If None, importance weights will be by individual columns.

  • match_args (dict, optional (default=None)) – Additional arguments for matching algorithm See the modules in fusion.implicit.matching for the list of possible matching parameters.

  • score_args (dict, optional (default=None)) – Additional arguments for scoring method For a list of scoring functions that can be used, look at sklearn.metrics.

  • ppc_id_col (int = 0) – Index of column serving as donor record index

  • panel_id_col (int = 0) – Index of column serving as recipient record index

Returns

self

Return type

object

Notes

The data contained in donors and recipients is assumed to have at least a few overlapping features with common values. They should also contain an id column appropriately titled.

transform(donors, recipients, target=None)

Using fused ids, impute information from donor data to the recipient data.

Parameters
  • donors (pandas.DataFrame) – Records containing information to be donated

  • recipients (pandas.DataFrame) – Records that will receive information

  • target (string, array-like, optional (default=None)) – What information will be shared from donor data. If None, data will be joined on fused indices and all overlapping fields from donors will have _d suffix.

Returns

ret – New DataFrame containing dontated information

Return type

pandas.DataFrame