Title: | Bootstrap with Survey Data |
---|---|
Description: | Implements different kinds of bootstraps to estimate sampling variation from survey data with complex designs. Includes the rescaled bootstrap described in Rust and Rao (1996) <doi:10.1177/096228029600500305> and Rao and Wu (1988) <doi:10.1080/01621459.1988.10478591>. |
Authors: | Dennis M. Feehan [aut, cre], Matthew J. Salganik [ths] |
Maintainer: | Dennis M. Feehan <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0.90000 |
Built: | 2025-02-19 04:29:22 UTC |
Source: | https://github.com/dfeehan/surveybootstrap |
Use a given bootstrap method to estimate sampling uncertainty from a given estimator.
bootstrap.estimates( survey.data, survey.design, bootstrap.fn, estimator.fn, num.reps, weights = NULL, ..., summary.fn = NULL, verbose = TRUE, parallel = FALSE, paropts = NULL )
bootstrap.estimates( survey.data, survey.design, bootstrap.fn, estimator.fn, num.reps, weights = NULL, ..., summary.fn = NULL, verbose = TRUE, parallel = FALSE, paropts = NULL )
survey.data |
The dataset to use |
survey.design |
A formula describing the design of the survey (see Details below) |
bootstrap.fn |
Name of the method to be used to take bootstrap resamples |
estimator.fn |
The name of a function which, given a dataset like
|
num.reps |
The number of bootstrap replication samples to draw |
weights |
Weights to use in estimation (or |
... |
additional arguments which will be passed on to |
summary.fn |
(Optional) Name of a function which, given the set of estimates
produced by |
verbose |
If |
parallel |
If |
paropts |
If not |
The formula describing the survey design should have the form
~ psu_v1 + psu_v2 + ... + strata(strata_v1 + strata_v2 + ...)
,
where psu_v1, ...
are the variables identifying primary sampling units (PSUs)
and strata_v1, ...
identifies the strata
If summary.fn
is not specified, then return the list of estimates
produced by estimator.fn
; if summary.fn
is specified, then return
its output
# example using a simple random sample survey <- MU284.surveys[[1]] estimator <- function(survey.data, weights) { plyr::summarise(survey.data, T82.hat = sum(S82 * weights)) } ex.mu284 <- bootstrap.estimates( survey.design = ~1, num.reps = 10, estimator.fn = estimator, weights='sample_weight', bootstrap.fn = 'srs.bootstrap.sample', survey.data=survey) ## Not run: idu.est <- bootstrap.estimates( ## this describes the sampling design of the ## survey; here, the PSUs are given by the ## variable cluster, and the strata are given ## by the variable region survey.design = ~ cluster + strata(region), ## the number of bootstrap resamples to obtain num.reps=1000, ## this is the name of the function ## we want to use to produce an estimate ## from each bootstrapped dataset estimator.fn="our.estimator", ## these are the sampling weights weights="indweight", ## this is the name of the type of bootstrap ## we wish to use bootstrap.fn="rescaled.bootstrap.sample", ## our dataset survey.data=example.survey, ## other parameters we need to pass ## to the estimator function d.hat.vals=d.hat, total.popn.size=tot.pop.size, y.vals="clients", missing="complete.obs") ## End(Not run)
# example using a simple random sample survey <- MU284.surveys[[1]] estimator <- function(survey.data, weights) { plyr::summarise(survey.data, T82.hat = sum(S82 * weights)) } ex.mu284 <- bootstrap.estimates( survey.design = ~1, num.reps = 10, estimator.fn = estimator, weights='sample_weight', bootstrap.fn = 'srs.bootstrap.sample', survey.data=survey) ## Not run: idu.est <- bootstrap.estimates( ## this describes the sampling design of the ## survey; here, the PSUs are given by the ## variable cluster, and the strata are given ## by the variable region survey.design = ~ cluster + strata(region), ## the number of bootstrap resamples to obtain num.reps=1000, ## this is the name of the function ## we want to use to produce an estimate ## from each bootstrapped dataset estimator.fn="our.estimator", ## these are the sampling weights weights="indweight", ## this is the name of the type of bootstrap ## we wish to use bootstrap.fn="rescaled.bootstrap.sample", ## our dataset survey.data=example.survey, ## other parameters we need to pass ## to the estimator function d.hat.vals=d.hat, total.popn.size=tot.pop.size, y.vals="clients", missing="complete.obs") ## End(Not run)
Take the data for each member of the given chain and assemble it together in a dataset.
chain.data(chain)
chain.data(chain)
chain |
The chain to build a dataset from |
A dataset comprised of all of the chain's members' data put together. The order of the rows in the dataset is not specified.
Count the total number of respondents in the chain and return it
chain.size(chain)
chain.size(chain)
chain |
The chain object |
The number of respondents involved in the chain
Get all of the values of the given variable found among members of a chain
chain.vals(chain, qoi.var = "uid")
chain.vals(chain, qoi.var = "uid")
chain |
The chain to get values from |
qoi.var |
The name of the variable to get from each member of the chain |
A vector with all of the values of qoi.var
found in this chain. (Currently, the order of the values
in the vector is not guaranteed.)
Break down RDS degree distributions by trait, and return an object which has the degrees for each trait as well as functions to draw degrees from each trait.
estimate.degree.distns(survey.data, d.hat.vals, traits, keep.vars = NULL)
estimate.degree.distns(survey.data, d.hat.vals, traits, keep.vars = NULL)
survey.data |
The respondent info |
d.hat.vals |
The variable that contains the degrees for each respondent |
traits |
A vector of the names of the columns
of |
keep.vars |
Additional vars to return along with degrees |
One of the items returned as a result is a function,
draw.degrees.fn
, which takes one argument,
traits
. This is a vector of traits and,
for each entry in this vector, draw.degress.fn
returns a draw from the empirical distribution of
degrees among respondents with that trait. So,
draw.degrees.fn(c("0.0", "0.1", "0.1")
would
return a degree drawn uniformly at random from among
the observed degrees of respondents with trait "0.0"
and then two degrees from respondents with trait "0.1"
An object with
distns
a list with one entry per trait value; each
draw.degrees.fn
a function which gets called with one
keep.vars
the name of the other vars that are kept (if any)
Given a dataset with the respondents and a dataset on the parents (in many cases the same individuals), and a set of relevant traits, estimate mixing parameters and return a markov model.
estimate.mixing(survey.data, parent.data, traits)
estimate.mixing(survey.data, parent.data, traits)
survey.data |
The respondent info |
parent.data |
The parent info |
traits |
The names of the traits to build the model on |
A list with entries:
mixing.df
the data used to estimate the mixing function
choose.next.state.fn
a function which can be passed
a vector of states and will return a draw of a subsequent state
for each entry in the vector
mixing.df
a dataframe (long-form) representation of
the transition counts used to estimate the transition probabilities
states
a list with an entry for each state. within
each state's entry are
trans.probs
a vector of estimated transition probabilities
trans.fn
a function which, when called, randomly chooses a
next state with probabilities given by the transition probs.
This function allows us to determine which ids are directly descended from which other ones. It is the only part of the code that relies on the ID format used by the Curitiba study (see Details); by modifying this function, it should be possible to adapt this code to another study.
is.child.ct(id, seed.id)
is.child.ct(id, seed.id)
id |
the id of the potential child |
seed.id |
the id of the potential parent |
See:
Salganik, M. J., Fazito, D., Bertoni, N., Abdo, A. H., Mello, M. B., & Bastos, F. I. (2011). Assessing network scale-up estimates for groups most at risk of HIV/AIDS: evidence from a multiple-method study of heavy drug users in Curitiba, Brazil. American journal of epidemiology, 174(10), 1190-1196.
TRUE if id
is the direct descendant of seed.id
and FALSE otherwise
Note that this assumes that the chain is a tree (no loops)
make.chain(seed.id, survey.data, is.child.fn = is.child.ct)
make.chain(seed.id, survey.data, is.child.fn = is.child.ct)
seed.id |
The id of the seed whose chain we wish to build from the dataset |
survey.data |
The dataset |
is.child.fn |
A function which takes two ids as arguments;
it is expected to return |
info
Get the height (maximum depth) of a chain
## S3 method for class 'depth' max(chain)
## S3 method for class 'depth' max(chain)
chain |
The chain object |
The maximum depth of the chain
Run a given markov model for n time steps, starting at a specified state.
mc.sim(mm, start, n)
mc.sim(mm, start, n)
mm |
The markov model object returned by |
start |
The name of the state to start in |
n |
The number of time-steps to run through |
This uses the markov model produced by estimate.mixing()
A vector with the state visited at each time step. the first entry has the starting state
A dataset containing information about Sweden's 284 municipalities. This dataset comes from Model-Assisted Survey Sampling by Sarndal, Swensson, and Wretman (2003, ISBN:0387406204). The columns are:
A data frame with 284 rows and 11 columns:
ID
Population in 1985
Population in 1975
Municipal tax revenue in 1985
Number of Conservative seats in municipal council
Number of Social-Democratic seats in municipal council
Total number of seats in municipal council
Number of municipal employees
Real estate values according to 1984 assessment
Geographic location indicator
Cluster indicator (neighboring municipalities are clustered together)
'Model Assisted Survey Sampling' by Sarndal, Swensson, and Wretman (2003, ISBN:0387406204)
Benchmark results to use in unit tests; these are based on MU284.complex.surveys.
A list with 10 data frames, each with 15 rows and 11 columns:
summaries for each estimand
A list with 10 sample surveys with sample size 15 drawn from the MU284 dataset using a complex sampling design.
A list with 10 data frames, each with 15 rows and 11 columns:
Same as MU284 dataset
The sampling weight for the row
The sampling design comes from Ex. 4.3.2 (pg 142-3) of 'Model Assisted Survey Sampling' by Sarndal, Swensson, and Wretman (2003, ISBN:0387406204).
The design is a two-stage sample:
stage I: the primary sampling units (PSUs) are the standard clusters from
MU284; we take a simple random sample without replacement of n_I = 5
out of N_I = 50
of these
stage II: within each sampled PSU, we take a simple random sample without
replacement of n_i = 3
out of N_i
municipalities
Produce estimates from a simulated sample survey of the MU284 population. Used in package tests and examples.
MU284.estimator.fn(survey.data, weights)
MU284.estimator.fn(survey.data, weights)
survey.data |
the survey dataset |
weights |
a vector with the survey weights |
a data.frame with one row and two columns:
TS82.hat
- the estimated total of S82
R.RMT85.P85.hat
- the estimated ratio of RMT85
/ P85
Summarize results from MU284.estimator.fn()
applied to many surveys.
(This is a dummy function, used for tests)
MU284.estimator.summary.fn(res)
MU284.estimator.summary.fn(res)
res |
a dataframe whose rows are the results of calling |
the same dataframe
A list with 10 sample surveys with sample size 15 drawn from the MU284 dataset using simple random sampling with replacment.
A list with 10 data frames, each with 15 rows and 11 columns:
Same as MU284 dataset
The sampling weight for the row
This function uses the algorithm described in the supporting online material for Weir et al 2012 (see Details) to take bootstrap resamples of one chain from an RDS dataset.
rds.boot.draw.chain(chain, mm, dd, parent.trait, idvar = "uid")
rds.boot.draw.chain(chain, mm, dd, parent.trait, idvar = "uid")
chain |
The chain to draw resamples for |
mm |
The mixing model to use |
dd |
The degree distns to use |
parent.trait |
A vector whose length is the number of bootstrap reps we want |
idvar |
The name of the variable used to label the columns of the output (presumably some id identifying the row in the original dataset they come from) |
See
Weir, Sharon S., et al. "A comparison of respondent-driven and venue-based sampling of female sex workers in Liuzhou, China." Sexually transmitted infections 88.Suppl 2 (2012): i95-i101.
A list of dataframes with one entry for each respondent in the chain. each dataframe has one row for each bootstrap replicate. so if we take 10 bootstrap resamples of a chain of length 50, there will be 50 entries in the list that is returned. each entry will be a dataframe with 10 rows.
Draw bootstrap resamples for an RDS dataset, using
the algorithm described in the supporting online
material of Weir et al 2012 (see rds.boot.draw.chain()
).
rds.chain.boot.draws(chains, mm, dd, num.reps, keep.vars = NULL)
rds.chain.boot.draws(chains, mm, dd, num.reps, keep.vars = NULL)
chains |
A list whose entries are the chains we want to resample |
mm |
The mixing model |
dd |
The degree distributions |
num.reps |
The number of bootstrap resamples we want |
keep.vars |
If not |
A list of length num.reps
; each entry in
the list has one bootstrap-resampled dataset
This algorithm picks a respondent from the survey to be a seed uniformly at random. it then generates a bootstrap draw by simulating the markov process forward for n steps, where n is the size of the draw required.
If you wish the bootstrap dataset to end up with
variables from the original dataset other than the
traits and degree, then you must specify this when
you construct dd
using the
'estimate.degree.distns
function.
rds.mc.boot.draws(chains, mm, dd, num.reps)
rds.mc.boot.draws(chains, mm, dd, num.reps)
chains |
A list with the chains constructed from the survey
using |
mm |
The mixing model |
dd |
The degree distributions |
num.reps |
The number of bootstrap resamples we want |
See:
Salganik, Matthew J. "Variance estimation, design effects, and sample size calculations for respondent-driven sampling." Journal of Urban Health 83.1 (2006): 98-112.
A list of length num.reps
; each entry in
the list has one bootstrap-resampled dataset
Given a survey dataset and a description of the survey design (ie, which combination of variables determines primary sampling units, and which combination of variables determines strata), take a bunch of bootstrap samples for the rescaled bootstrap estimator (see Details).
rescaled.bootstrap.sample( survey.data, survey.design, parallel = FALSE, paropts = NULL, num.reps = 1 )
rescaled.bootstrap.sample( survey.data, survey.design, parallel = FALSE, paropts = NULL, num.reps = 1 )
survey.data |
The dataset to use |
survey.design |
A formula describing the design of the survey (see Details) |
parallel |
If |
paropts |
An optional list of arguments passed on to |
num.reps |
The number of bootstrap replication samples to draw |
survey.design
is a formula of the form
weight ~ psu_vars + strata(strata_vars)
where:
weight
is the variable with the survey weights
psu_vars
has the form psu_v1 + psu_v2 + ...
, where primary
sampling units (PSUs) are determined by psu_v1
, etc
strata_vars
has the form strata_v1 + strata_v2 + ...
, which
determine strata
Note that we assume that the formula uniquely specifies PSUs. This will always be true if the PSUs were selected without replacement. If they were selected with replacement, then it will be necessary to make each realization of a given PSU in the sample a unique id. The code below assumes that all observations within each PSU (as identified by the design formula) are from the same draw of the PSU.
The rescaled bootstrap technique works by adjusting the estimation weights based on the number of times each row is included in the resamples. If a row is never selected, it is still included in the returned results, but its weight will be set to 0. It is therefore important to use estimators that make use of the estimation weights on the resampled datasets.
We always take m_i = n_i - 1, according to the advice presented in Rao and Wu (1988) and Rust and Rao (1996).
(This is a C++ version; a previous version, written in pure R,
is called rescaled.bootstrap.sample.pureR()
)
References:
Rust, Keith F., and J. N. K. Rao. "Variance estimation for complex surveys using replication techniques." Statistical methods in medical research 5.3 (1996): 283-310.
Rao, Jon NK, and C. F. J. Wu. "Resampling inference with complex survey data." Journal of the American Statistical Association 83.401 (1988): 231-241.
A list with num.reps
entries. Each entry is a dataset which
has at least the variables index
(the row index of the original
dataset that was resampled) and weight.scale
(the factor by which to multiply the sampling weights
in the original dataset).
survey <- MU284.complex.surveys[[1]] boot_surveys <- rescaled.bootstrap.sample(survey.data = survey, survey.design = ~ CL, num.reps = 2)
survey <- MU284.complex.surveys[[1]] boot_surveys <- rescaled.bootstrap.sample(survey.data = survey, survey.design = ~ CL, num.reps = 2)
(this is the pure R version; it has been supplanted by
rescaled.bootstrap.sample
, which is partially written in C++)
rescaled.bootstrap.sample.pureR( survey.data, survey.design, parallel = FALSE, paropts = NULL, num.reps = 1 )
rescaled.bootstrap.sample.pureR( survey.data, survey.design, parallel = FALSE, paropts = NULL, num.reps = 1 )
survey.data |
the dataset to use |
survey.design |
a formula describing the design of the survey (see below - TODO) |
parallel |
if TRUE, use parallelization (via |
paropts |
an optional list of arguments passed on to |
num.reps |
the number of bootstrap replication samples to draw |
given a survey dataset and a description of the survey design (ie, which combination of vars determines primary sampling units, and which combination of vars determines strata), take a bunch of bootstrap samples for the rescaled bootstrap estimator (see, eg, Rust and Rao 1996).
Note that we assume that the formula uniquely specifies PSUs. This will always be true if the PSUs were selected without replacement. If they were selected with replacement, then it will be necessary to make each realization of a given PSU in the sample a unique id. Bottom line: the code below assumes that all observations within each PSU (as identified by the design formula) are from the same draw of the PSU.
The rescaled bootstrap technique works by adjusting the estimation weights based on the number of times each row is included in the resamples. If a row is never selected, it is still included in the returned results, but its weight will be set to 0. It is therefore important to use estimators that make use of the estimation weights on the resampled datasets.
We always take m_i = n_i - 1, according to the advice presented in Rao and Wu (1988) and Rust and Rao (1996).
survey.design
is a formula of the form
weight ~ psu_vars + strata(strata_vars),
where weight is the variable with the survey weights and psu
is the variable denoting the primary sampling unit
a list with num.reps
entries. each entry is a dataset which
has at least the variables index
(the row index of the original
dataset that was resampled) and weight.scale
(the factor by which to multiply the sampling weights
in the original dataset).
This function creates a dataset with rescaled bootstrap weights;
it can be a helpful alternative to bootstrap.estimates
in some situations
rescaled.bootstrap.weights( survey.data, survey.design, num.reps, weights = NULL, idvar, verbose = TRUE, parallel = FALSE, paropts = NULL )
rescaled.bootstrap.weights( survey.data, survey.design, num.reps, weights = NULL, idvar, verbose = TRUE, parallel = FALSE, paropts = NULL )
survey.data |
The dataset to use |
survey.design |
A formula describing the design of the survey
(see Details in |
num.reps |
the number of bootstrap replication samples to draw |
weights |
weights to use in estimation (or NULL, if none) |
idvar |
the name of the column in |
verbose |
if TRUE, produce lots of feedback about what is going on |
parallel |
if TRUE, use the plyr library's .parallel argument to produce bootstrap resamples and estimates in parallel |
paropts |
if not NULL, additional arguments to pass along to the parallelization routine |
The formula describing the survey design should have the form
~ psu_v1 + psu_v2 + ... + strata(strata_v1 + strata_v2 + ...)
,
where psu_v1, ...
are the variables identifying primary sampling units (PSUs)
and strata_v1, ...
identify the strata
if no summary.fn is specified, then return the list of estimates produced by estimator.fn; if summary.fn is specified, then return its output
survey <- MU284.complex.surveys[[1]] rescaled.bootstrap.weights(survey.data = survey, survey.design = ~ CL, weights='sample_weight', idvar='LABEL', num.reps = 2) ## Not run: bootweights <- rescaled.bootstrap.weights( # formula describing survey design: # psu and strata survey.design = ~ psu + stratum(stratum_analysis), num.reps=10000, # column with respondent ids idvar='caseid', # column with sampling weight weights='wwgt', # survey dataset survey.data=mw.ego) ## End(Not run)
survey <- MU284.complex.surveys[[1]] rescaled.bootstrap.weights(survey.data = survey, survey.design = ~ CL, weights='sample_weight', idvar='LABEL', num.reps = 2) ## Not run: bootweights <- rescaled.bootstrap.weights( # formula describing survey design: # psu and strata survey.design = ~ psu + stratum(stratum_analysis), num.reps=10000, # column with respondent ids idvar='caseid', # column with sampling weight weights='wwgt', # survey dataset survey.data=mw.ego) ## End(Not run)
Given a survey dataset and a description of the survey design (ie, which combination of vars determines primary sampling units, and which combination of vars determines strata), take a bunch of bootstrap samples under a simple random sampling (with repetition) scheme
srs.bootstrap.sample( survey.data, num.reps = 1, parallel = FALSE, paropts = NULL, ... )
srs.bootstrap.sample( survey.data, num.reps = 1, parallel = FALSE, paropts = NULL, ... )
survey.data |
The dataset to use |
num.reps |
The number of bootstrap replication samples to draw |
parallel |
If |
paropts |
An optional list of arguments passed on to |
... |
Ignored, but useful because it allows params like |
A list with num.reps
entries. Each entry is a dataset which has
at least the variables index
(the row index of the original dataset that
was resampled) and weight.scale
(the factor by which to multiply the
sampling weights in the original dataset).
survey <- MU284.surveys[[1]] boot_surveys <- srs.bootstrap.sample(survey, num.reps = 2)
survey <- MU284.surveys[[1]] boot_surveys <- srs.bootstrap.sample(survey, num.reps = 2)
surveybootstrap
has methods for analyzing data that were collected
using network reporting techniques. It includes estimators appropriate for
the simple bootstrap and the rescaled bootstrap.