2  Introduction and Overview

The goal of the nonprobsvy R package is to carry out statistical inferences with nonprobability survey samples (including big data) when auxiliary information from external sources like probability samples or population totals/means are available. It should be noted that there are several packages that allow for correcting selection bias in non-probability samples such as GJRM (Marra et al. 2017), NonProbEst (Rueda et al. 2020) or even sampling (Tillé and Matei 2021). However, these packages do not implement state-of-the-art approaches recently proposed in the literature: Chen et al. (2020), Yang et al. (2020), Wu (2022) nor use survey package [Lumley 2004] for inference.

We implemented propensity score weighting (e.g. with calibration constraints), mass imputation (e.g. nearest neighbor) and doubly robust estimators that take into account minimization of the asymptotic bias of the population mean estimators, variable selection or overlap between random and nonrandom sample. The package uses survey package functionalities when a probability sample is available. Probability sampling methods are adopted by official statistics and researchers in many areas, such as social or health science. Probability samples are collected under known sampling design and therefore highly-representative for the target population. On the other hand it is known that this type of sampling is expensive and subject to high non-response rates, that increases every year (source here). With advance new technology and big data sources, nonprobability samples become more and more popular in statistical inference. They contain rich information about target population and provide cost-time efficiency in comparison to probability samples. On the other hand selection mechanism is not known for nonprobability samples and therefore it is a great misuse to say that this sample are representative for the target population.

A popular framework in data integration is to assume that auxiliary variables on the same population are available for probabiliy samples or population totals from external sources and therefore one can combine these information with biased nonprobability sample. In this book we perform and expand known methods for statistical inference using described framework for data integration. We can treat data integration as a missing data problem with following structure

Sample Type \(\bX\) \(Y\) Representative?
\(S_B\) Probability Sample \(\checkmark\) Yes
\(S_A\) Non-probability Sample \(\checkmark\) \(\checkmark\) No

Instead of sample \(S_B\) we can consider vector of population totals/means, which will be the subject of consideration later in the book. Under the setting from the table above we are going to develop statistical methods of inference, that require certain assumptions on the outcome or selection models, and sampling mechanism for \(S_B\) as well. Structure of the book is as follow: in the chapters 2, 3 and 4 we describe recommended methods for the issue presented. In the chapter 5 we propose techniques of variables selection for high-dimensional data, which can reduce bias of the estimator. In the last chapters, we present simulation results performed with the help of the package under development, as well as a summary and consider the methods worthy of further work.