Predictive Mean Matching¶
-
class
datafusionsm.implicit_model.
PMM
(targets, match_method='nearest', score_method='euclidean', model_method=None)¶ Fuse two data sources together using Predictive Mean Matching. A model for the target is trained on the donor data and then applied to both the donor and recipient data sets. Statistical Matching (hot-deck imputation) is then performed, based on record similarity/distances on the predicted target values. Live values (actually observed) from the donor data is then imputed for the recipient.
- Parameters
match_method (str (default='nearest')) – ‘nearest’, ‘neighbors’, ‘hungarian’, ‘jonker_volgenant’ algorithm used to match records from each data source
score_method (str, callable, optional (default 'euclidean')) – similarity/distance measure to compare records. Can be any metric available in scipy.spatial.distance or sklearn.metrics
model_method (str, optional (default None)) – Type/class of model used for predicting the target variable.
-
critical
¶ Critical cells to match within where records must match perfectly
- Type
array-like[str]
-
results
¶ - For each target:
matched id pairs - record indices or id column values usage - count of donor usage scores - distances for matched records
- Type
dict[string, array-like[tuple[str, str]]]
Examples
>>> from datafusionsm.datasets import load_tv_panel, load_online_survey >>> from datafusionsm.implicit_model import PMM >>> panel = load_tv_panel() >>> survey = load_online_survey() >>> pmm = PMM(match_method="jonker_volgenant", ... model_method="forest").fit(panel, survey, critical="age,gender") >>> fused = pmm.transform(panel, survey, target="income")
-
fit
(donors, recipients, linking=None, critical=None, match_args=None, score_args=None, model_args=None, donor_id_col=0, recipient_id_col=0)¶ Fuse two data sources by matching records
- Parameters
donors (pandas.DataFrame) – Records containing information to be donated
recipients (pandas.DataFrame) – Records that will receive information
linking (array-like, optional (default=None)) – List of columns that will link the two data sources if None, all overlapping columns will be used
critical (array-like, optional (default=None)) – Features that must match exactly when fusing
match_args (dict, optional (default=None)) – Additional arguments for matching algorithm See the modules in
fusion.implicit.matching
for the list of possible matching parameters.score_args (dict, optional (default=None)) – Additional arguments for scoring method For a list of scoring functions that can be used, look at sklearn.metrics.
model_args (dict, optional (default=None)) – Additional arguments for the target model.
ppc_id_col (int = 0) – Index of column serving as donor record index
panel_id_col (int = 0) – Index of column serving as recipient record index
- Returns
self
- Return type
object
Notes
The data contained in donors and recipients is assumed to have at least a few overlapping features with common values. They should also contain an id column appropriately titled.
-
transform
(donors, recipients)¶ Using fused ids, impute information from donor data to the recipient data.
- Parameters
donors (pandas.DataFrame) – Records containing information to be donated
recipients (pandas.DataFrame) – Records that will receive information
- Returns
ret – New DataFrame containing dontated information
- Return type
pandas.DataFrame