Model for the outcome for the mass imputation estimator using loess via stats::loess
.
Estimation of the mean is done using the \(S_B\) probability sample.
Usage
method_npar(
y_nons,
X_nons,
X_rand,
svydesign,
weights = NULL,
family_outcome = "gaussian",
start_outcome = NULL,
vars_selection = FALSE,
pop_totals = NULL,
pop_size = NULL,
control_outcome = control_out(),
control_inference = control_inf(),
verbose = FALSE,
se = TRUE
)
Arguments
- y_nons
target variable from non-probability sample
- X_nons
a
model.matrix
with auxiliary variables from non-probability sample- X_rand
a
model.matrix
with auxiliary variables from non-probability sample- svydesign
a svydesign object
- weights
case / frequency weights from non-probability sample (default NULL)
- family_outcome
family for the glm model)
- start_outcome
a place holder (not used in
method_npar
)- vars_selection
whether variable selection should be conducted
- pop_totals
a place holder (not used in
method_npar
)- pop_size
population size from the
nonprob
function- control_outcome
controls passed by the
control_out
function- control_inference
controls passed by the
control_inf
function- verbose
parameter passed from the main
nonprob
function- se
whether standard errors should be calculated
Value
an nonprob_method
class which is a list
with the following entries
- model_fitted
fitted model object returned by
stats::loess
- y_nons_pred
predicted values for the non-probablity sample
- y_rand_pred
predicted values for the probability sample or population totals
- coefficients
coefficients for the model (if available)
- svydesign
an updated
surveydesign2
object (new columny_hat_MI
is added)- y_mi_hat
estimated population mean for the target variable
- vars_selection
whether variable selection was performed
- var_prob
variance for the probability sample component (if available)
- var_nonprob
variance for the non-probability sampl component
- model
model type (character
"npar"
)
Details
Analytical variance
The variance of the mean is estimated based on the following approach
(a) non-probability part (\(S_A\) with size \(n_A\); denoted as var_nonprob
in the result)
$$ \hat{V}_1 = \frac{1}{N^2} \sum_{i=1}^{n_A} \left\lbrace\hat{g}_B(\boldsymbol{x}_i)\right\rbrace^{2} \hat{e}_i^2, $$
where \(\hat{e}_i=y_i - \hat{m}(x_i)\) is the residual and \(\hat{g}_B(\boldsymbol{x}_i) = \left\lbrace \pi_B(\boldsymbol{x}_i) \right\rbrace^{-1}\) can be estimated
various ways. In the package we estimate \(\hat{g}_B(\boldsymbol{x}_i)\) using \(\pi_B(\boldsymbol{x}_i)=E(R | \boldsymbol{x})\) as suggested by Chen et al. (2022, p. 6). In particular,
we currently support this using stats::loesswith
"gaussian"` family.
(b) probability part (\(S_B\) with size \(n_B\); denoted as var_prob
in the result)
This part uses functionalities of the {survey}
package and the variance is estimated using the following
equation:
$$ \hat{V}_2=\frac{1}{N^2} \sum_{i=1}^{n_B} \sum_{j=1}^{n_B} \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}} \frac{\hat{m}(x_i)}{\pi_i} \frac{\hat{m}(x_j)}{\pi_j}. $$
Note that \(\hat{V}_2\) in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
References
Chen, S., Yang, S., & Kim, J. K. (2022). Nonparametric mass imputation for data integration. Journal of Survey Statistics and Methodology, 10(1), 1-24.
Examples
set.seed(123123123)
N <- 10000
n_a <- 500
n_b <- 1000
n_b1 <- 0.7*n_b
n_b2 <- 0.3*n_b
x1 <- rnorm(N, 2, 1)
x2 <- rnorm(N, 2, 1)
y1 <- rnorm(N, 0.3 + 2*x1+ 2*x2, 1)
y2 <- rnorm(N, 0.3 + 0.5*x1^2+ 0.5*x2^2, 1)
strata <- x1 <= 2
pop <- data.frame(x1, x2, y1, y2, strata)
sample_a <- pop[sample(1:N, n_a),]
sample_a$w_a <- N/n_a
sample_a_svy <- svydesign(ids=~1, weights=~w_a, data=sample_a)
pop1 <- subset(pop, strata == TRUE)
pop2 <- subset(pop, strata == FALSE)
sample_b <- rbind(pop1[sample(1:nrow(pop1), n_b1), ],
pop2[sample(1:nrow(pop2), n_b2), ])
res_y_npar <- nonprob(outcome = y1 + y2 ~ x1 + x2,
data = sample_b,
svydesign = sample_a_svy,
method_outcome = "npar")
res_y_npar
#> A nonprob object
#> - estimator type: mass imputation
#> - method: npar (gaussian)
#> - auxiliary variables source: survey
#> - vars selection: false
#> - variance estimator: analytic
#> - population size fixed: false
#> - naive (uncorrected) estimators:
#> - variable y1: 7.5715
#> - variable y2: 4.6272
#> - selected estimators:
#> - variable y1: 8.3408 (se=0.1221, ci=(8.1015, 8.5802))
#> - variable y2: 5.3156 (se=0.1295, ci=(5.0618, 5.5695))