1  Introduction and Overview

The goal of the nonprobsvy R package is to perform statistical inference on non-probability survey samples, including big data, when auxiliary information from external sources such as probability samples or population totals and means is available. It should be noted that there are several packages that allow for correcting selection bias in non-probability samples, such as GJRM (Marra and Rodicw 2023), NonProbEst (Luis Castro Martín and Mar Rueda 2020), or even sampling (Tillé and Matei 2021). However, these packages do not implement state-of-the-art approaches that have recently been proposed in the literature e.g., (Chen, Li, and Wu 2020; Yang, Kim, and Song 2020; Wu 2022), nor do they use the survey package (Lumley 2004) for inference.

We implemented propensity score weighting (e.g., with calibration constraints), mass imputation (e.g., nearest neighbor), and doubly robust estimators that account for minimizing the asymptotic bias of the population mean estimators, variable selection, and the overlap between random and nonrandom sample. The package uses functionalities from the survey package when a probability sample is available. Probability sampling methods are adopted by official statistics and researchers in many areas, such as social or health science. Probability samples are collected under a known sampling design and are therefore highly representative of the target population. However, this type of sampling is expensive and subject to high non-response rates, which increase every year (source here). With advances in technology and the increasing availability of big data sources, non-probability samples are becoming increasingly popular in statistical inference. They contain rich information about the target population and offer cost and time efficiency compared to probability samples. On the other hand, the selection mechanism for non-probability samples is not known, so it is misleading to claim that these samples are representative of the target population.

A popular framework in data integration assumes that auxiliary variables for the same population are available for probabiliy samples or from population totals provided by external sources, and therefore one can combine this information with biased non-probability samples. In this book, we implement and expand upon established methods for statistical inference using the data integration framework described above. We can treat data integration as a missing data problem with the following structure:

Sample Type \(\bX\) \(Y\) Representative?
\(S_B\) Probability Sample \(\checkmark\) Yes
\(S_A\) Non-probability Sample \(\checkmark\) \(\checkmark\) No

Instead of sample \(S_B\), we can consider a vector of population totals or means, which will be discussed later in the book. Based on the setting from the table above, we are going to develop statistical inference methods that require certain assumptions about the outcome or selection models, as well as sampling mechanism for \(S_B\).

The structure of the book is as follows: In Chapters 2, 3 and 4, we describe recommended methods for addressing the presented issues. In Chapter 5, we propose techniques of variable selection for high-dimensional data, which can help reduce estimator bias. In the final last chapters, we present simulation results obtained using the package under development, provide a summary, and discuss methods that warrant further research.