Model for the outcome for the mass imputation estimator. The implementation is currently based on RANN::nn2 function and thus it uses Euclidean distance for matching units from (non-probability) to (probability) based on predicted values from model based
either on method_glm
or method_npar
. Estimation of the mean is done using sample.
This implementation extends Yang et al. (2021) approach as described in Chlebicki et al. (2025), namely:
- pmm_weights
if k>1 weighted aggregation of the mean for a given unit is used. We use distance matrix returned by RANN::nn2 function (
pmm_weights
from thecontrol_out()
function)- nn_exact_se
if the non-probability sample is small we recommend using a mini-bootstrap approach to estimate variance from the non-probability sample (
nn_exact_se
from thecontrol_inf()
function)- pmm_k_choice
the main
nonprob
function allows for dynamic selection ofk
neighbours based on the variance minimization procedure (pmm_k_choice
from thecontrol_out()
function)
Usage
method_pmm(
y_nons,
X_nons,
X_rand,
svydesign,
weights = NULL,
family_outcome = "gaussian",
start_outcome = NULL,
vars_selection = FALSE,
pop_totals = NULL,
pop_size = NULL,
control_outcome = control_out(),
control_inference = control_inf(),
verbose = FALSE,
se = TRUE
)
Arguments
- y_nons
target variable from non-probability sample
- X_nons
a
model.matrix
with auxiliary variables from non-probability sample- X_rand
a
model.matrix
with auxiliary variables from non-probability sample- svydesign
a svydesign object
- weights
case / frequency weights from non-probability sample
- family_outcome
family for the glm model
- start_outcome
start parameters
- vars_selection
whether variable selection should be conducted
- pop_totals
a place holder (not used in
method_pmm
)- pop_size
population size from the
nonprob
function- control_outcome
controls passed by the
control_out
function- control_inference
controls passed by the
control_inf
function- verbose
parameter passed from the main
nonprob
function- se
whether standard errors should be calculated
Value
an nonprob_method
class which is a list
with the following entries
- model_fitted
fitted model either an
glm.fit
orcv.ncvreg
object- y_nons_pred
predicted values for the non-probablity sample
- y_rand_pred
predicted values for the probability sample or population totals
- coefficients
coefficients for the model (if available)
- svydesign
an updated
surveydesign2
object (new columny_hat_MI
is added)- y_mi_hat
estimated population mean for the target variable
- vars_selection
whether variable selection was performed
- var_prob
variance for the probability sample component (if available)
- var_nonprob
variance for the non-probability sampl component
- model
model type (character
"pmm"
)- family
depends on the method selected for estimating E(Y|X)
Details
Matching
In the package we support two types of matching:
matching (default;
control_out(pmm_match_type = 1)
).matching (
control_out(pmm_match_type = 2)
).
Analytical variance
The variance of the mean is estimated based on the following approach
(a) non-probability part ( with size ; denoted as var_nonprob
in the result) is currently estimated using the non-parametric mini-bootstrap estimator proposed by
Chlebicki et al. (2025, Algorithm 2). It is not proved to be consistent but with good finite population properties.
This bootstrap can be applied using control_inference(nn_exact_se=TRUE)
and
can be summarized as follows:
Sample units from with replacement to create (if pseudo-weights are present inclusion probabilities should be proportional to their inverses).
Estimate regression model based on from step 1.
Compute for using estimated and .
Compute using values from .
Repeat steps 1-4 times (we set (hard-coded) in our code).
Estimate obtained from simulations and save it as
var_nonprob
.
(b) probability part ( with size ; denoted as var_prob
in the result)
This part uses functionalities of the {survey}
package and the variance is estimated using the following
equation:
Note that in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
Examples
data(admin)
data(jvs)
jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs)
res_pmm <- method_pmm(y_nons = admin$single_shift,
X_nons = model.matrix(~ region + private + nace + size, admin),
X_rand = model.matrix(~ region + private + nace + size, jvs),
svydesign = jvs_svy)
res_pmm
#> Mass imputation model (PMM approach). Estimated mean: 0.6969 (se: 0.0146)