Hot Deck Imputation¶
-
class
datafusionsm.implicit_model.
HotDeck
(match_method='nearest', score_method='cosine', minimize=True, importance=None)¶ fuse two data sources together using statistical matching (hot-deck imputation) based on record similarity/distances and information implicit to each
- Parameters
match_method (str (default='nearest')) – ‘nearest’, ‘neighbors’, ‘hungarian’, ‘jonker_volgenant’ algorithm used to match records from each data source
score_method (str, callable, optional (default 'cosine')) – similarity/distance measure to compare records. Can be any metric available in scipy.spatial.distance or sklearn.metrics
minimize (boolean (default=True)) – minimize distance between records or maximize score/similarity
importance (str (default=None)) –
-
critical
¶ Critical cells to match within where records must match perfectly
- Type
array-like[str]
-
matches
¶ Matched record pairs - record indices or id column values
- Type
array-like[tuple[str, str]]
-
usage
¶ How often a donor was used as a matching donor
- Type
Counter
-
imp_wgts
¶ Weights defining the impact of each feature when comparing records.
- Type
array-like[float]
Examples
>>> from datafusionsm.datasets import load_tv_panel, load_online_survey >>> from datafusionsm.implicit_model import HotDeck >>> panel = load_tv_panel() >>> survey = load_online_survey() >>> hd = HotDeck( ... match_method="neighbors", score_method="manhattan" ... ).fit(panel, survey, critical="age,gender") >>> fused = hd.transform(panel, survey, target="income")
-
fit
(donors, recipients, linking=None, critical=None, imp_wgts=None, target=None, match_args=None, score_args=None, donor_id_col=0, recipient_id_col=0)¶ Fuse two data sources by matching records
- Parameters
donors (pandas.DataFrame) – Records containing information to be donated
recipients (pandas.DataFrame) – Records that will receive information
linking (array-like, optional (default=None)) – List of columns that will link the two data sources if None, all overlapping columns will be used
critical (array-like, optional (default=None)) – Features that must match exactly when fusing
imp_wgts (array-like, dict, optional(default=None)) – Importance weights for linking variables
target (string, array-like, optional (default=None)) – What information will be donated. When given here, serves as the reference column if calculating importance weights. If None, importance weights will be by individual columns.
match_args (dict, optional (default=None)) – Additional arguments for matching algorithm See the modules in
fusion.implicit.matching
for the list of possible matching parameters.score_args (dict, optional (default=None)) – Additional arguments for scoring method For a list of scoring functions that can be used, look at sklearn.metrics.
ppc_id_col (int = 0) – Index of column serving as donor record index
panel_id_col (int = 0) – Index of column serving as recipient record index
- Returns
self
- Return type
object
Notes
The data contained in donors and recipients is assumed to have at least a few overlapping features with common values. They should also contain an id column appropriately titled.
-
transform
(donors, recipients, target=None)¶ Using fused ids, impute information from donor data to the recipient data.
- Parameters
donors (pandas.DataFrame) – Records containing information to be donated
recipients (pandas.DataFrame) – Records that will receive information
target (string, array-like, optional (default=None)) – What information will be shared from donor data. If None, data will be joined on fused indices and all overlapping fields from donors will have _d suffix.
- Returns
ret – New DataFrame containing dontated information
- Return type
pandas.DataFrame