--- title: "The rescaled bootstrap" author: - "Dennis M. Feehan" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{The rescaled bootstrap} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction This vignette demonstrates the basic functionality of the `surveybootstrap` package. The goal is to illustrate how to use the **rescaled bootstrap** (Rao and Wu 1988; Rust and Rao 1996) to take bootstrap resamples from a survey dataset when the survey sample has a complex design. The **rescaled bootstrap** handles complex designs by resampling at the level of the primary sampling unit (PSU) and adjusting the sampling weights of every observation in each bootstrap replicate. Specifically, for each bootstrap resample, we draw $m_i = n_i - 1$ PSUs with replacement from the $n_i$ PSUs in stratum $i$, and rescales the observation weights to reflect how many times each PSU was selected. This package implements the rescaled bootstrap in C++ (via Rcpp) for speed. Use `rescaled.bootstrap.sample` / `bootstrap.estimates` when your survey has a cluster or stratified design. For a simple random sample without clusters or strata, use `srs.bootstrap.sample` instead. --- ## The MU284 example data The `MU284` dataset (Sarndal, Swensson, and Wretman 2003, pg. 652) describes 284 Swedish municipalities. Relevant columns include: | Column | Description | |---------|-------------| | `LABEL` | Municipality ID | | `S82` | Total seats in municipal council | | `RMT85` | Municipal tax revenue, 1985 (in millions of Kronor) | | `P85` | Population, 1985 (in thousands) | | `CL` | Cluster indicator (a set of neighboring municipalities) | `MU284.complex.surveys` is a list of 10 sample surveys drawn from this population using the two-stage design from Example 4.3.2 of Sarndal et al.: - **Stage I** — simple random sample without replacement of $n_I = 5$ PSUs (clusters) from $N_I = 50$ total PSUs - **Stage II** — simple random sample without replacement of $n_i = 3$ municipalities from each selected PSU This gives $n = 15$ sampled municipalities per survey. Each row has a `sample_weight` reflecting the inverse of its selection probability. ```{r show-data} library(surveybootstrap) survey <- MU284.complex.surveys[[1]] head(survey[, c("LABEL", "CL", "S82", "RMT85", "P85", "sample_weight")]) ``` --- There are two ways to use the package: * [Path 1: specifying a function to produce and estimate from data with bootstrap weights](#Path-1) * [Path 2: getting a matrix of bootstrap weights](#Path-2) --- ## Path 1: `bootstrap.estimates()` `bootstrap.estimates()` is the highest-level interface. You supply an estimator function and a bootstrap method; it returns bootstrap replicate estimates that you can use to compute standard errors and confidence intervals. ### Step 1: define an estimator Your estimator must accept two arguments — `survey.data` (the resampled dataset) and `weights` (a vector of rescaled sampling weights) — and return a named data frame with one row. ```{r define-estimator} my_estimator <- function(survey.data, weights) { data.frame( total_S82 = sum(survey.data$S82 * weights), ratio_RMT85_P85 = sum(survey.data$RMT85 * weights) / sum(survey.data$P85 * weights) ) } ``` Here `total_S82` is the Horvitz-Thompson estimator of the population total of `S82` (total council seats), and `ratio_RMT85_P85` is a ratio estimator (tax revenue per capita). ### Step 2: Run the bootstrap ```{r run-bootstrap} set.seed(12345) boot_res <- bootstrap.estimates( survey.data = survey, survey.design = ~ CL, # CL identifies the PSU (cluster) num.reps = 500, estimator.fn = my_estimator, # we just defined this above weights = "sample_weight", bootstrap.fn = "rescaled.bootstrap.sample" ) ``` The `survey.design` formula uses the column `CL` as the PSU identifier. There are no explicit strata here, so the entire sample is treated as a single stratum. See [Describing your survey design](#describing-your-survey-design) below for formulas with strata. `boot_res` is a list of 500 one-row data frames, one per bootstrap replicate. ### Step 3: Collect and summarize ```{r summarise} boot_df <- do.call("rbind", boot_res) # Bootstrap standard errors apply(boot_df, 2, sd) # Bootstrap means (should be close to the original point estimates) apply(boot_df, 2, mean) # Original point estimates for comparison my_estimator(survey, survey$sample_weight) ``` You can also pass `summary.fn` to `bootstrap.estimates()` to get a summary computed in a single pass without keeping all replicates in memory: ```{r summary-fn, eval = FALSE} boot_summary <- bootstrap.estimates( survey.data = survey, survey.design = ~ CL, num.reps = 500, estimator.fn = my_estimator, weights = "sample_weight", bootstrap.fn = "rescaled.bootstrap.sample", summary.fn = MU284.estimator.summary.fn, # example summary function verbose = FALSE ) ``` ### Visualising the bootstrap distribution ```{r boot-hist, fig.width = 6, fig.height = 4, fig.alt = "Histogram of 500 bootstrap estimates of the total number of municipal council seats across all Swedish municipalities. The distribution is roughly bell-shaped and centred near the original point estimate."} hist(boot_df$total_S82, main = "Bootstrap distribution: total council seats", xlab = "Estimated total (S82)", col = "steelblue", border = "white") abline(v = my_estimator(survey, survey$sample_weight)$total_S82, col = "firebrick", lwd = 2, lty = 2) legend("topright", legend = "Point estimate", col = "firebrick", lwd = 2, lty = 2, bty = "n") ``` --- ## Path 2: `get.rescaled.bootstrap.weights()` Sometimes it is more convenient to obtain a matrix of bootstrap weights directly and then apply your estimator(s) yourself. This is useful when: - you want to compute many different estimators from the same set of replicates - you need to pass bootstrap weights to external software - you are implementing jackknife-after-bootstrap (JAB) variance estimation ### Obtain bootstrap weights ```{r get-weights} boot_wts <- get.rescaled.bootstrap.weights( survey.data = survey, survey.design = ~ CL, idvar = "LABEL", weights = "sample_weight", num.reps = 500 ) ``` The result is a list with two elements: - **`orig_weights`** — a data frame with one row per respondent and a `weight` column containing the original sampling weight - **`boot_weights`** — a data frame with one row per respondent and columns `LABEL`, `boot_weight_1`, …, `boot_weight_500` ```{r inspect-weights} names(boot_wts) # First few rows and columns of boot_weights boot_wts$boot_weights[1:5, 1:4] ``` ### Compute estimates from bootstrap weights Join the survey data to the weight matrix, then apply your estimator for each replicate: ```{r manual-variance} # Merge survey data with bootstrap weights on the respondent id bw <- merge(survey[, c("LABEL", "S82", "RMT85", "P85")], boot_wts$boot_weights, by = "LABEL") # Compute the total_S82 estimator for each bootstrap replicate boot_totals <- sapply(paste0("boot_weight_", 1:500), function(col) { sum(bw$S82 * bw[[col]]) }) # Bootstrap standard error sd(boot_totals) ``` ### Scaling factors Pass `include_scaling_factors = TRUE` to also get the per-replicate weight scaling factors. Each bootstrap weight is the product of the original weight and the corresponding scaling factor: $$\text{boot\_weight}_{i,r} = \text{orig\_weight}_i \times \text{boot\_rep}_{i,r}$$ ```{r scaling-factors, eval = FALSE} boot_wts_sf <- get.rescaled.bootstrap.weights( survey.data = survey, survey.design = ~ CL, idvar = "LABEL", weights = "sample_weight", num.reps = 500, include_scaling_factors = TRUE ) # boot_wts_sf$weight_scaling_factor has columns LABEL, boot_rep_1, ..., boot_rep_500 ``` --- ## Describing your survey design The `survey.design` argument is a formula that tells `surveybootstrap` which columns identify PSUs and strata. | Formula | Meaning | |---------|---------| | `~ CL` | PSU is `CL`; no strata (one global stratum) | | `~ psu + strata(region)` | PSU is `psu`; strata defined by `region` | | `~ psu1 + psu2 + strata(s1 + s2)` | PSU uniquely identified by `(psu1, psu2)`; strata by `(s1, s2)` | | `~ 1` | Simple random sample — use `srs.bootstrap.sample` instead | For a simple random sample, use `bootstrap.fn = "srs.bootstrap.sample"` and `survey.design = ~ 1`: ```{r srs-example, eval = FALSE} srs_survey <- MU284.surveys[[1]] boot_srs <- bootstrap.estimates( survey.data = srs_survey, survey.design = ~ 1, num.reps = 500, estimator.fn = my_estimator, weights = "sample_weight", bootstrap.fn = "srs.bootstrap.sample" ) ``` --- ## Acknowledgement DF used Claude Opus 4.6 to help improve this vignette. ## References Rao, J.N.K. and Wu, C.F.J. (1988). Resampling inference with complex survey data. *Journal of the American Statistical Association*, 83(401), 231–241. Rao, J. N. K., C. F. J. Wu, and Kim Yue. "Some recent work on resampling methods for complex surveys." *Survey Methodology* 18.2 (1992): 209-217. Rust, K.F. and Rao, J.N.K. (1996). Variance estimation for complex surveys using replication techniques. *Statistical Methods in Medical Research*, 5(3), 283–310. Sarndal, C.-E., Swensson, B. and Wretman, J. (2003). *Model Assisted Survey Sampling*. Springer. ISBN: 0387406204.