Model for the outcome for the mass imputation estimator using generalized linear
models via the stats::glm
function. Estimation of the mean is done using \(S_B\)
probability sample or known population totals.
Usage
method_glm(
y_nons,
X_nons,
X_rand,
svydesign,
weights = NULL,
family_outcome = "gaussian",
start_outcome = NULL,
vars_selection = FALSE,
pop_totals = NULL,
pop_size = NULL,
control_outcome = control_out(),
control_inference = control_inf(),
verbose = FALSE,
se = TRUE
)
Arguments
- y_nons
target variable from non-probability sample
- X_nons
a
model.matrix
with auxiliary variables from non-probability sample- X_rand
a
model.matrix
with auxiliary variables from non-probability sample- svydesign
a svydesign object
- weights
case / frequency weights from non-probability sample
- family_outcome
family for the glm model
- start_outcome
start parameters (default
NULL
)- vars_selection
whether variable selection should be conducted
- pop_totals
population totals from the
nonprob
function- pop_size
population size from the
nonprob
function- control_outcome
controls passed by the
control_out
function- control_inference
controls passed by the
control_inf
function (currently not used, for further development)- verbose
parameter passed from the main
nonprob
function- se
whether standard errors should be calculated
Value
an nonprob_method
class which is a list
with the following entries
- model_fitted
fitted model either an
glm.fit
orcv.ncvreg
object- y_nons_pred
predicted values for the non-probablity sample
- y_rand_pred
predicted values for the probability sample or population totals
- coefficients
coefficients for the model (if available)
- svydesign
an updated
surveydesign2
object (new columny_hat_MI
is added)- y_mi_hat
estimated population mean for the target variable
- vars_selection
whether variable selection was performed
- var_prob
variance for the probability sample component (if available)
- var_nonprob
variance for the non-probability sampl component
- var_total
total variance, if possible it should be
var_prob+var_nonprob
if not, just a scalar- model
model type (character
"glm"
)- family
family type (character
"glm"
)
Details
Analytical variance
The variance of the mean is estimated based on the following approach
(a) non-probability part (\(S_A\) with size \(n_A\); denoted as var_nonprob
in the result)
$$ \hat{V}_1 = \frac{1}{n_A^2}\sum_{i=1}^{n_A} \hat{e}_i \left\lbrace \boldsymbol{h}(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})^\prime\hat{\boldsymbol{c}}\right\rbrace, $$
where \(\hat{e}_i = y_i - m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})\) and $$\widehat{\boldsymbol{c}}=\left\lbrace n_B^{-1} \sum_{i \in B} \dot{\boldsymbol{m}}\left(\boldsymbol{x}_i ; \boldsymbol{\beta}^*\right) \boldsymbol{h}\left(\boldsymbol{x}_i ; \boldsymbol{\beta}^*\right)^{\prime}\right\rbrace^{-1} N^{-1} \sum_{i \in A} w_i \dot{\boldsymbol{m}}\left(\boldsymbol{x}_i ; \boldsymbol{\beta}^*\right)$$.
Under the linear regression model \(\boldsymbol{h}\left(\boldsymbol{x}_i ; \widehat{\boldsymbol{\beta}}\right)=\boldsymbol{x}_i\) and \(\widehat{\boldsymbol{c}}=\left(n_A^{-1} \sum_{i \in A} \boldsymbol{x}_i \boldsymbol{x}_i^{\prime}\right)^{-1} N^{-1} \sum_{i \in B} w_i \boldsymbol{x}_i .\)
(b) probability part (\(S_B\) with size \(n_B\); denoted as var_prob
in the result)
This part uses functionalities of the {survey}
package and the variance is estimated using the following
equation:
$$ \hat{V}_2=\frac{1}{N^2} \sum_{i=1}^{n_B} \sum_{j=1}^{n_B} \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}} \frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_i} \frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_j}. $$
Note that \(\hat{V}_2\) in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
Furthermore, if only population totals/means are known and assumed to be fixed we set \(\hat{V}_2=0\).
References
Kim, J. K., Park, S., Chen, Y., & Wu, C. (2021). Combining non-probability and probability survey samples through mass imputation. Journal of the Royal Statistical Society Series A: Statistics in Society, 184(3), 941-963.
Examples
data(admin)
data(jvs)
jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs)
res_glm <- method_glm(y_nons = admin$single_shift,
X_nons = model.matrix(~ region + private + nace + size, admin),
X_rand = model.matrix(~ region + private + nace + size, jvs),
svydesign = jvs_svy)
res_glm
#> Mass imputation model (GLM approach). Estimated mean: 0.7039 (se: 0.0115)