data-fusion-sm: data fusion via statistical matching

data-fusion-sm is a Python package that provides classes and methods to join two data sources that contain disparate information; the goal is to share information that is specific to either data source. This is often referred to as data fusion or statistical matching.

Data fusion should be understood as a much broader class of algorithms that can include techniques designed for various types of data including sensors, medical equipment, genomes, surveys, and many more. Algorithms for general missing data imputation can also be used to perform data fusion, irrespective of data type. Statistical matching is a specific set of methods designed for data similar to surveys and panels, where records are typically people with a set of measured attributes. data-fusion-sm is designed to work with such data and includes common statistical matching methods to perform data fusion.

Minimal Examples

Fusion of two surveys via Hot-Deck imputation:

import pandas as pd
from datafusionsm.implicit_model import HotDeck

survey1 = pd.read_csv("survey1")
survey2 = pd.read_csv("survey2")

impF = HotDeck()
fused_survey = impF.fit_transform(survey1, survey2)

Fusion of two surveys via Predictive Mean Matching:

import pandas as pd
from datafusionsm.implicit_model import PMM

survey1 = pd.read_csv("survey1")
survey2 = pd.read_csv("survey2")

impF = PMM("forest")
fused_survey = impF.fit_transform(survey1, survey2)