
Inference with non-probability survey samples
Source:R/nonprob.R
, R/nonprob_documentation.R
nonprob.Rd
nonprob
function provides an access to the various methods for inference based on non-probability surveys (including big data). The function allows to estimate the population mean based on the access to a reference probability sample (via the survey
package), as well as totals or means of covariates.
The package implements state-of-the-art approaches recently proposed in the literature: Chen et al. (2020),
Yang et al. (2020), Wu (2022) and uses the Lumley 2004 survey
package for inference (if a reference probability sample is provided).
It provides various propensity score weighting (e.g. with calibration constraints), mass imputation (e.g. nearest neighbour, predictive mean matching) and doubly robust estimators (e.g. that take into account minimisation of the asymptotic bias of the population mean estimators).
The package uses the survey
package functionality when a probability sample is available.
All optional parameters are set to NULL
. The obligatory ones include data
as well as one of the following three:
selection
, outcome
, or target
– depending on which method has been selected.
In the case of outcome
and target
multiple y variables can be specified.
Usage
nonprob(
data,
selection = NULL,
outcome = NULL,
target = NULL,
svydesign = NULL,
pop_totals = NULL,
pop_means = NULL,
pop_size = NULL,
method_selection = c("logit", "cloglog", "probit"),
method_outcome = c("glm", "nn", "pmm", "npar"),
family_outcome = c("gaussian", "binomial", "poisson"),
subset = NULL,
strata = NULL,
weights = NULL,
na_action = NULL,
control_selection = control_sel(),
control_outcome = control_out(),
control_inference = control_inf(),
start_selection = NULL,
start_outcome = NULL,
verbose = FALSE,
se = TRUE,
...
)
Arguments
- data
a
data.frame
with dataset containing the non-probability sample.- selection
a
formula
(defaultNULL
) for the selection (propensity) score model.- outcome
a
formula
(defaultNULL
) for the outcome (target) model.- target
a
formula
(defaultNULL
) with target variable(s). We allow multiple target variables (e.g.~y1 + y2 + y3
).- svydesign
an optional
svydesign2
class object containing a probability sample and design weights.- pop_totals
an optional
named vector
with population totals of the covariates.- pop_means
an optional
named vector
with population means of the covariates.- pop_size
an optional
double
value with population size.- method_selection
a
character
(defaultlogit
) indicating the method for the propensity score link function.- method_outcome
a
character
(defaultglm
) indicating the method for the outcome model.- family_outcome
a
character
(defaultgaussian
) describing the error distribution and the link function to be used in the model. Currently supports:gaussian
with the identity link,poisson
andbinomial
.- subset
an optional
vector
specifying a subset of observations to be used in the fitting process - not yet supported.- strata
an optional
vector
specifying strata (not yet supported, for further development).- weights
an optional
vector
of prior weights to be used in the fitting process. It is assumed that this vector contains frequency or analytic weights (i.e. rows of thedata
argument are repeated according to the values of theweights
argument), not probability/design weights.- na_action
a function which indicates what should happen when the data contain
NAs
(not yet supported, for further development).- control_selection
a
list
(defaultcontrol_sel()
result) indicating parameters to be used when fitting the selection model for propensity scores. To change the parameters one should use thecontrol_sel()
function.- control_outcome
a
list
(defaultcontrol_out()
result) indicating parameters to be used when fitting the model for the outcome variable. To change the parameters one should use thecontrol_out()
function.- control_inference
a
list
(defaultcontrol_inf()
result) indicating parameters to be used for inference based on probability and non-probability samples. To change the parameters one should use thecontrol_inf()
function.- start_selection
an optional
vector
with starting values for the parameters of the selection equation.- start_outcome
an optional
vector
with starting values for the parameters of the outcome equation.- verbose
a numerical value (default
TRUE
) whether detailed information on the fitting should be presented.- se
Logical value (default
TRUE
) indicating whether to calculate and return standard error of estimated mean.- ...
Additional, optional arguments.
Value
Returns an object of class c("nonprobsvy", "nonprobsvy_ipw")
in case of inverse probability weighting estimator, c("nonprobsvy", "nonprobsvy_mi")
in case of mass imputation estimator, or c("nonprobsvy", "nonprobsvy_dr")
in case of doubly robust estimator,
of type list
containing:
X
– amodel.matrix
containing data from probability and non-probability samples if specified at a function call.y
– alist
of vector of outcome variables if specified at a function call.R
– anumeric vector
indicating whether a unit belongs to the probability (0) or non-probability (1) units in the matrix X.prob
– anumeric vector
of estimated propensity scores for non-probability sample.weights
– avector
of estimated weights for non-probability sample.control
– alist
of control functions.output
– an output of the model with information on the estimated population mean and standard errors.SE
– adata.frame
with standard error of the estimator of the population mean, divided into errors from probability and non-probability samples.confidence_interval
– adata.frame
with confidence interval of population mean estimator.nonprob_size
– a scalarnumeric vector
denoting the size of non-probability sample.prob_size
– a scalarnumeric vector
denoting the size of probability sample.pop_size
– a scalarnumeric vector
estimated population size derived from estimated weights (non-probability sample) or known design weights (probability sample).pop_totals
– anumeric vector
with the total values of the auxiliary variables derived from a probability sample or anumeric vector
of the total/mean values.estimator
– acharacter vector
with information what type of estimator was selected (one ofc("ipw", "mi", "dr")
).outcome
– alist
containing information about the fitting of the mass imputation model, in the case of regression model the object containing the list returned bystats::glm()
, in the case of the nearest neighbour imputation the object containing list returned byRANN::nn2()
. Ifbias_correction
incontrol_inf()
is set toTRUE
, the estimation is based on the joint estimating equations for theselection
andoutcome
model and therefore, the list is different from the one returned by thestats::glm()
function and contains elements such ascoefficients
– anumeric vector
with estimated coefficients of the regression model.std_err
– anumeric vector
with standard errors of the estimated coefficients.residuals
– anumeric vector
with the response residuals.variance_covariance
– amatrix
with the variance-covariance matrix of the coefficient estimates.df_residual
– a scalarvector
with the degrees of freedom for residuals.family
– acharacter
that specifies the error distribution and link function to be used in the model.fitted.values
– anumeric vector
with the predicted values of the response variable based on the fitted model.linear.predictors
– anumeric vector
with the linear fit on link scale.X
– amatrix
with the design matrix (model.matrix
)method
– set onglm
, since the regression method.model_frame
– amodel.matrix
of data from probability sample used for mass imputation.
cve
– the error for each value oflambda
, averaged across the cross-validation folds.
selection
– alist
containing information about fitting of propensity score model, such ascoefficients
– anumeric vector
of coefficients.std_err
– anumeric vector
with standard errors of the estimated model coefficients.residuals
– anumeric vector
with the response residuals.variance
– a scalarnumeric vector
the root mean square error.fitted_values
– anumeric vector
with the fitted mean values, obtained by transforming the linear predictors by the inverse of the link function.link
– thelink
object used.linear_predictors
– anumeric vector
with the linear fit on link scale.aic
– A version of Akaike's An Information Criterion, minus twice the maximized log-likelihood plus twice the number of parameters.weights
– anumeric vector
with estimated weights for non-probability sample.prior.weights
– anumeric vector
with the frequency weights initially supplied, a vector of 1s if none were.est_totals
– anumeric vector
with the estimated total values of auxiliary variables derived from a non-probability sample.formula
– the formula supplied.df_residual
– the residual degrees of freedom.log_likelihood
– value of log-likelihood function ifmle
method, in the other caseNA
.cve
– the error for each value of thelambda
, averaged across the cross-validation folds for the variable selection model when the propensity score model is fitting. Returned only if selection of variables for the model is used.method_selection
– Link function, e.g.logit
,cloglog
orprobit
.hessian
– Hessian Gradient of the log-likelihood function frommle
method.gradient
– Gradient of the log-likelihood function frommle
method.method
– An estimation method for selection model, e.g.mle
orgee
.prob_der
– Derivative of the inclusion probability function for units in a non–probability sample.prob_rand
– Inclusion probabilities for unit from a probability sample from thesvydesign
object.prob_rand_est
– Inclusion probabilities to a non-probability sample for unit from probability sample.prob_rand_est_der
– Derivative of the inclusion probabilities to a non–probability sample for unit from probability sample.
stat
– amatrix
of the estimated population means in each bootstrap iteration. Returned only if a bootstrap method is used to estimate the variance andkeep_boot
incontrol_inf()
is set onTRUE
.
Details
Let y be the response variable for which we want to estimate the population mean,
given by _y = 1N _i=1^N y_i. For this purpose we consider data integration
with the following structure. Let S_A be the non-probability sample with the design matrix of covariates as
X_A =
bmatrix
x_11 & x_12 & & x_1p
x_21 & x_22 & & x_2p
& & &
x_n_A1 & x_n_A2 & & x_n_Ap
bmatrix
and vector of outcome variable
y =
bmatrix
y_1
y_2
y_n_A.
bmatrix
On the other hand, let S_B be the probability sample with design matrix of covariates be
X_B =
bmatrix
x_11 & x_12 & & x_1p
x_21 & x_22 & & x_2p
& & &
x_n_B1 & x_n_B2 & & x_n_Bp.
bmatrix
Instead of a sample of units we can consider a vector of population sums in the form of _x = (_i Ux_i1, _i Ux_i2, ..., _i Ux_ip) or means
_xN, where U refers to a finite population. Note that we do not assume access to the response variable for S_B.
In general we make the following assumptions:
The selection indicator of belonging to non-probability sample R_i and the response variable y_i are independent given the set of covariates x_i.
All units have a non-zero propensity score, i.e., _i^A > 0 for all i.
The indicator variables R_i^A and R_j^A are independent for given x_i and x_j for i j.
There are three possible approaches to the problem of estimating population mean using non-probability samples:
Inverse probability weighting – The main drawback of non-probability sampling is the unknown selection mechanism for a unit to be included in the sample. This is why we talk about the so-called "biased sample" problem. The inverse probability approach is based on the assumption that a reference probability sample is available and therefore we can estimate the propensity score of the selection mechanism. The estimator has the following form: _IPW = 1N^A_i S_A y_i_i^A. For this purpose several estimation methods can be considered. The first approach is maximum likelihood estimation with a corrected log-likelihood function, which is given by the following formula ^*() = _i S_A (x_i, )1 - (x_i,) + _i S_Bd_i^B 1 - (x_i,). In the literature, the main approach to modelling propensity scores is based on the logit link function. However, we extend the propensity score model with the additional link functions such as cloglog and probit. The pseudo-score equations derived from ML methods can be replaced by the idea of generalised estimating equations with calibration constraints defined by equations. U()=_i S_A h(x_i, )-_i S_B d_i^B (x_i, ) h(x_i, ). Notice that for h(x_i, ) = (x, )x We do not need a probability sample and can use a vector of population totals/means.
Mass imputation – This method is based on a framework where imputed values of outcome variables are created for the entire probability sample. In this case, we treat the large sample as a training data set that is used to build an imputation model. Using the imputed values for the probability sample and the (known) design weights, we can build a population mean estimator of the form: _MI = 1N^B_i S_B d_i^B y_i. It opens the door to a very flexible method for imputation models. The package uses generalized linear models from
stats::glm()
, the nearest neighbour algorithm usingRANN::nn2()
and predictive mean matching.Doubly robust estimation – The IPW and MI estimators are sensitive to misspecified models for the propensity score and outcome variable, respectively. To this end, so-called doubly robust methods are presented that take these problems into account. It is a simple idea to combine propensity score and imputation models during inference, leading to the following estimator _DR = 1N^A_i S_A d_i^A (y_i - y_i) + 1N^B_i S_B d_i^B y_i. In addition, an approach based directly on bias minimisation has been implemented. The following formula aligned bias(_DR) = & E (_DR - )
= & E 1N _i=1^N (R_i^A_i^A (x_i^T ) - 1 ) (y_i - m(x_i^T ))
+ & E 1N _i=1^N (R_i^B d_i^B - 1) m( x_i^T ) , aligned lead us to system of equations aligned J(, ) = arrayc J_1(, )
J_2(, ) array = arrayc _i=1^N R_i^A\ 1(x_i, )-1 y_i-m(x_i, ) x_i
_i=1^N R_i^A(x_i, ) m(x_i, ) - _i S_B d_i^B m(x_i, ) array , aligned where m(x_i, ) is a mass imputation (regression) model for the outcome variable and propensity scores _i^A are estimated using alogit
function for the model. As with theMLE
andGEE
approaches we have extended this method tocloglog
andprobit
links.
As it is not straightforward to calculate the variances of these estimators, asymptotic equivalents of the variances derived using the Taylor approximation have been proposed in the literature. Details can be found here. In addition, the bootstrap approach can be used for variance estimation.
The function also allows variables selection using known methods that have been implemented to handle the integration of probability and non-probability sampling.
In the presence of high-dimensional data, variable selection is important, because it can reduce the variability in the estimate that results from using irrelevant variables to build the model.
Let U( , ) be the joint estimating function for ( , ). We define the
penalized estimating functions as
U^p (, ) = U(, ) - arrayc
q__(||) sgn() \
q__(|\boldsymbol|) sgn()
array ,
where _ and q__ are some smooth functions. We let q_ (x) = p_ x, where p_ is some penalization function.
Details of penalization functions and techniques for solving this type of equation can be found here.
To use the variable selection model, set the vars_selection
parameter in the control_inf()
function to TRUE
. In addition, in the other control functions such as control_sel()
and control_out()
you can set parameters for the selection of the relevant variables, such as the number of folds during cross-validation algorithm or the lambda value for penalizations. Details can be found
in the documentation of the control functions for nonprob
.
References
Kim JK, Park S, Chen Y, Wu C. Combining non-probability and probability survey samples through mass imputation. J R Stat Soc Series A. 2021;184:941– 963.
Shu Yang, Jae Kwang Kim, Rui Song. Doubly robust inference when combining probability and non-probability samples with high dimensional data. J. R. Statist. Soc. B (2020)
Yilin Chen , Pengfei Li & Changbao Wu (2020) Doubly Robust Inference With Nonprobability Survey Samples, Journal of the American Statistical Association, 115:532, 2011-2021
Shu Yang, Jae Kwang Kim and Youngdeok Hwang Integration of data from probability surveys and big found data for finite population inference using mass imputation. Survey Methodology, June 2021 29 Vol. 47, No. 1, pp. 29-58
See also
stats::optim()
– For more information on the optim
function used in the
optim
method of propensity score model fitting.
maxLik::maxLik()
– For more information on the maxLik
function used in
maxLik
method of propensity score model fitting.
ncvreg::cv.ncvreg()
– For more information on the cv.ncvreg
function used in
variable selection for the outcome model.
nleqslv::nleqslv()
– For more information on the nleqslv
function used in
estimation process of the bias minimization approach.
stats::glm()
– For more information about the generalised linear models used during mass imputation process.
RANN::nn2()
– For more information about the nearest neighbour algorithm used during mass imputation process.
control_sel()
– For the control parameters related to selection model.
control_out()
– For the control parameters related to outcome model.
control_inf()
– For the control parameters related to statistical inference.
Examples
# \donttest{
# generate data based on Doubly Robust Inference With Non-probability Survey Samples (2021)
# Yilin Chen , Pengfei Li & Changbao Wu
library(sampling)
#>
#> Attaching package: ‘sampling’
#> The following objects are masked from ‘package:survival’:
#>
#> cluster, strata
set.seed(123)
# sizes of population and probability sample
N <- 20000 # population
n_b <- 1000 # probability
# data
z1 <- rbinom(N, 1, 0.7)
z2 <- runif(N, 0, 2)
z3 <- rexp(N, 1)
z4 <- rchisq(N, 4)
# covariates
x1 <- z1
x2 <- z2 + 0.3 * z2
x3 <- z3 + 0.2 * (z1 + z2)
x4 <- z4 + 0.1 * (z1 + z2 + z3)
epsilon <- rnorm(N)
sigma_30 <- 10.4
sigma_50 <- 5.2
sigma_80 <- 2.4
# response variables
y30 <- 2 + x1 + x2 + x3 + x4 + sigma_30 * epsilon
y50 <- 2 + x1 + x2 + x3 + x4 + sigma_50 * epsilon
y80 <- 2 + x1 + x2 + x3 + x4 + sigma_80 * epsilon
# population
sim_data <- data.frame(y30, y50, y80, x1, x2, x3, x4)
## propensity score model for non-probability sample (sum to 1000)
eta <- -4.461 + 0.1 * x1 + 0.2 * x2 + 0.1 * x3 + 0.2 * x4
rho <- plogis(eta)
# inclusion probabilities for probability sample
z_prob <- x3 + 0.2051
sim_data$p_prob <- inclusionprobabilities(z_prob, n = n_b)
# data
sim_data$flag_nonprob <- UPpoisson(rho) ## sampling nonprob
sim_data$flag_prob <- UPpoisson(sim_data$p_prob) ## sampling prob
nonprob_df <- subset(sim_data, flag_nonprob == 1) ## non-probability sample
svyprob <- svydesign(
ids = ~1, probs = ~p_prob,
data = subset(sim_data, flag_prob == 1),
pps = "brewer"
) ## probability sample
## mass imputation estimator
MI_res <- nonprob(
outcome = y80 ~ x1 + x2 + x3 + x4,
data = nonprob_df,
svydesign = svyprob
)
MI_res
#> A nonprob object
#> - estimator type: mass imputation
#> - method: glm (gaussian)
#> - auxiliary variables source: survey
#> - vars selection: false
#> - variance estimator: analytic
#> - population size fixed: false
#> - naive (uncorrected) estimator: 11.9797
#> - selected estimator: 9.5183 (se=0.1510, ci=(9.2223, 9.8143))
## inverse probability weighted estimator
IPW_res <- nonprob(
selection = ~ x1 + x2 + x3 + x4,
target = ~y80,
data = nonprob_df,
svydesign = svyprob
)
IPW_res
#> A nonprob object
#> - estimator type: inverse probability weighting
#> - method: logit (mle)
#> - auxiliary variables source: survey
#> - vars selection: false
#> - variance estimator: analytic
#> - population size fixed: false
#> - naive (uncorrected) estimator: 11.9797
#> - selected estimator: 9.4911 (se=0.5555, ci=(8.4023, 10.5798))
## doubly robust estimator
DR_res <- nonprob(
outcome = y80 ~ x1 + x2 + x3 + x4,
selection = ~ x1 + x2 + x3 + x4,
data = nonprob_df,
svydesign = svyprob
)
DR_res
#> A nonprob object
#> - estimator type: doubly robust
#> - method: glm (gaussian)
#> - auxiliary variables source: survey
#> - vars selection: false
#> - variance estimator: analytic
#> - population size fixed: false
#> - naive (uncorrected) estimator: 11.9797
#> - selected estimator: 9.4835 (se=0.1587, ci=(9.1725, 9.7945))
# }