The goal of the nonprobsvy
R package is to perform statistical inference on non-probability survey samples, including big data, when auxiliary information from external sources such as probability samples or population totals and means is available. It should be noted that there are several packages that allow for correcting selection bias in non-probability samples, such as GJRM
(Marra and Rodicw 2023), NonProbEst
(Luis Castro Martín and Mar Rueda 2020), or even sampling
(Tillé and Matei 2021). However, these packages do not implement state-of-the-art approaches that have recently been proposed in the literature e.g., (Chen, Li, and Wu 2020; Yang, Kim, and Song 2020; Wu 2022), nor do they use the survey
package (Lumley 2004) for inference.
We implemented propensity score weighting (e.g., with calibration constraints), mass imputation (e.g., nearest neighbor), and doubly robust estimators that account for minimizing the asymptotic bias of the population mean estimators, variable selection, and the overlap between random and nonrandom sample. The package uses functionalities from the survey
package when a probability sample is available. Probability sampling methods are adopted by official statistics and researchers in many areas, such as social or health science. Probability samples are collected under a known sampling design and are therefore highly representative of the target population. However, this type of sampling is expensive and subject to high non-response rates, which increase every year (source here). With advances in technology and the increasing availability of big data sources, non-probability samples are becoming increasingly popular in statistical inference. They contain rich information about the target population and offer cost and time efficiency compared to probability samples. On the other hand, the selection mechanism for non-probability samples is not known, so it is misleading to claim that these samples are representative of the target population.
A popular framework in data integration assumes that auxiliary variables for the same population are available for probabiliy samples or from population totals provided by external sources, and therefore one can combine this information with biased non-probability samples. In this book, we implement and expand upon established methods for statistical inference using the data integration framework described above. We can treat data integration as a missing data problem with the following structure:
\(S_B\) |
Probability Sample |
\(\checkmark\) |
|
Yes |
\(S_A\) |
Non-probability Sample |
\(\checkmark\) |
\(\checkmark\) |
No |
Instead of sample \(S_B\), we can consider a vector of population totals or means, which will be discussed later in the book. Based on the setting from the table above, we are going to develop statistical inference methods that require certain assumptions about the outcome or selection models, as well as sampling mechanism for \(S_B\).
The structure of the book is as follows: In Chapters 2, 3 and 4, we describe recommended methods for addressing the presented issues. In Chapter 5, we propose techniques of variable selection for high-dimensional data, which can help reduce estimator bias. In the final last chapters, we present simulation results obtained using the package under development, provide a summary, and discuss methods that warrant further research.
Chen, Yilin, Pengfei Li, and Changbao Wu. 2020. “Doubly Robust Inference with Nonprobability Survey Samples.” Journal of the American Statistical Association 115 (532): 2011–21.
Luis Castro Martín, Ramón Ferri García, and María del Mar Rueda. 2020. NonProbEst: Estimation in Nonprobability Sampling.
Lumley, Thomas. 2004. “Survey r Package.”
Marra, Giampero, and Rosalba Rodicw. 2023. GJRM: Generalized Joint Regression Modelling.
Tillé, Yves, and Alina Matei. 2021.
Sampling: Survey Sampling.
https://CRAN.R-project.org/package=sampling.
Wu, Changbao. 2022. “Statistical Inference with Non-Probability Survey Samples.” Survey Methodology 48: 283–311.
Yang, Shu, Jae Kwang Kim, and Rui Song. 2020.
“Doubly Robust Inference When Combining Probability and Non-Probability Samples with High Dimensional Data.” Journal of the Royal Statistical Society Series B: Statistical Methodology 82 (2): 445–65.
https://doi.org/10.1111/rssb.12354.