Predictive Mean Matching

class datafusionsm.implicit_model.PMM(targets, match_method='nearest', score_method='euclidean', model_method=None)

Fuse two data sources together using Predictive Mean Matching. A model for the target is trained on the donor data and then applied to both the donor and recipient data sets. Statistical Matching (hot-deck imputation) is then performed, based on record similarity/distances on the predicted target values. Live values (actually observed) from the donor data is then imputed for the recipient.

Parameters
  • match_method (str (default='nearest')) – ‘nearest’, ‘neighbors’, ‘hungarian’, ‘jonker_volgenant’ algorithm used to match records from each data source

  • score_method (str, callable, optional (default 'euclidean')) – similarity/distance measure to compare records. Can be any metric available in scipy.spatial.distance or sklearn.metrics

  • model_method (str, optional (default None)) – Type/class of model used for predicting the target variable.

critical

Critical cells to match within where records must match perfectly

Type

array-like[str]

results
For each target:

matched id pairs - record indices or id column values usage - count of donor usage scores - distances for matched records

Type

dict[string, array-like[tuple[str, str]]]

Examples

>>> from datafusionsm.datasets import load_tv_panel, load_online_survey
>>> from datafusionsm.implicit_model import PMM
>>> panel = load_tv_panel()
>>> survey = load_online_survey()
>>> pmm = PMM(match_method="jonker_volgenant",
...           model_method="forest").fit(panel, survey, critical="age,gender")
>>> fused = pmm.transform(panel, survey, target="income")
fit(donors, recipients, linking=None, critical=None, match_args=None, score_args=None, model_args=None, donor_id_col=0, recipient_id_col=0)

Fuse two data sources by matching records

Parameters
  • donors (pandas.DataFrame) – Records containing information to be donated

  • recipients (pandas.DataFrame) – Records that will receive information

  • linking (array-like, optional (default=None)) – List of columns that will link the two data sources if None, all overlapping columns will be used

  • critical (array-like, optional (default=None)) – Features that must match exactly when fusing

  • match_args (dict, optional (default=None)) – Additional arguments for matching algorithm See the modules in fusion.implicit.matching for the list of possible matching parameters.

  • score_args (dict, optional (default=None)) – Additional arguments for scoring method For a list of scoring functions that can be used, look at sklearn.metrics.

  • model_args (dict, optional (default=None)) – Additional arguments for the target model.

  • ppc_id_col (int = 0) – Index of column serving as donor record index

  • panel_id_col (int = 0) – Index of column serving as recipient record index

Returns

self

Return type

object

Notes

The data contained in donors and recipients is assumed to have at least a few overlapping features with common values. They should also contain an id column appropriately titled.

transform(donors, recipients)

Using fused ids, impute information from donor data to the recipient data.

Parameters
  • donors (pandas.DataFrame) – Records containing information to be donated

  • recipients (pandas.DataFrame) – Records that will receive information

Returns

ret – New DataFrame containing dontated information

Return type

pandas.DataFrame