Predictive Mean Matching¶

class datafusionsm.implicit_model.PMM(targets, match_method='nearest', score_method='euclidean', model_method=None)¶

Fuse two data sources together using Predictive Mean Matching. A model for the target is trained on the donor data and then applied to both the donor and recipient data sets. Statistical Matching (hot-deck imputation) is then performed, based on record similarity/distances on the predicted target values. Live values (actually observed) from the donor data is then imputed for the recipient.

Parameters

match_method (str (default='nearest')) – ‘nearest’, ‘neighbors’, ‘hungarian’, ‘jonker_volgenant’ algorithm used to match records from each data source
score_method (str, callable, optional (default 'euclidean')) – similarity/distance measure to compare records. Can be any metric available in scipy.spatial.distance or sklearn.metrics
model_method (str, optional (default None)) – Type/class of model used for predicting the target variable.

critical¶

Critical cells to match within where records must match perfectly

Type: array-like[str]

results¶

For each target:: matched id pairs - record indices or id column values usage - count of donor usage scores - distances for matched records

Type: dict[string, array-like[tuple[str, str]]]

Examples

>>> from datafusionsm.datasets import load_tv_panel, load_online_survey
>>> from datafusionsm.implicit_model import PMM
>>> panel = load_tv_panel()
>>> survey = load_online_survey()
>>> pmm = PMM(match_method="jonker_volgenant",
...           model_method="forest").fit(panel, survey, critical="age,gender")
>>> fused = pmm.transform(panel, survey, target="income")

fit(donors, recipients, linking=None, critical=None, match_args=None, score_args=None, model_args=None, donor_id_col=0, recipient_id_col=0)¶

Fuse two data sources by matching records

Parameters

donors (pandas.DataFrame) – Records containing information to be donated
recipients (pandas.DataFrame) – Records that will receive information
linking (array-like, optional (default=None)) – List of columns that will link the two data sources if None, all overlapping columns will be used
critical (array-like, optional (default=None)) – Features that must match exactly when fusing
match_args (dict, optional (default=None)) – Additional arguments for matching algorithm See the modules in fusion.implicit.matching for the list of possible matching parameters.
score_args (dict, optional (default=None)) – Additional arguments for scoring method For a list of scoring functions that can be used, look at sklearn.metrics.
model_args (dict, optional (default=None)) – Additional arguments for the target model.
ppc_id_col (int = 0) – Index of column serving as donor record index
panel_id_col (int = 0) – Index of column serving as recipient record index

Returns

self

Return type

object

Notes

The data contained in donors and recipients is assumed to have at least a few overlapping features with common values. They should also contain an id column appropriately titled.

transform(donors, recipients)¶

Using fused ids, impute information from donor data to the recipient data.

Parameters

donors (pandas.DataFrame) – Records containing information to be donated
recipients (pandas.DataFrame) – Records that will receive information

Returns

ret – New DataFrame containing dontated information

Return type

pandas.DataFrame

Predictive Mean Matching¶

Previous topic

Next topic

This Page