
05: Gathering and Loading Samples
RSTr-samples.Rmd
Overview
After initializing the model with initialize_model()
,
you can gather and load samples using the run_sampler()
and
load_samples()
functions. Finally, using some basic
plotting, we can look at our results and make sure they make sense.
The run_sampler()
function
run_sampler()
takes in four arguments:
name
, dir
, .show_plots
, and
.discard_burnin
. name
and dir
are
associated wtih the name of the folder that contains the model content
and the directory where that folder lives, respectively.
.show_plots
allows you to hide the traceplots generated
while the model runs and .discard_burnin
prevents samples
from being saved before the 2000-iteration burn-in period in case the
dataset is particularly large. The information needed to generate the
model is already saved in name
, and so
run_sampler()
pulls this information in to generate
parameter samples. When running run_sampler()
, the R
console will generate outputs describing the batch number, the total
iteration number, and the time when the last batch started. By default,
run_sampler()
will generate 6000 samples, but other values
can be specified as desired, rounded down to the nearest 100. For
example, we can begin a simple model and generate samples:
library(RSTr)
initialize_model(name = "my_test_model", data = miheart, adjacency = miadj)
run_sampler("my_test_model")
The load_samples()
function
Once run_sampler()
tells you that the model is finished
running, you can import samples into R using
load_samples()
. load_samples()
takes in four
arguments:
name
: Name of the model;dir
: Directory of the model. Saves to R’s temporary directory by default;param
: The parameter to import samples for. By default, importstheta
. Values fortheta
andbeta
are expit- or exp-transformed, depending on themethod
chosen ininitialize_model()
; andburn
: Specifies a burn-in period for samples. This allows the model time to stabilize before using samples to generate estimates. By default, has a burn-in period of 2000 samples. Because of the way thatload_samples()
utilizesburn
, it is safest to choose a burn-in period that is a multiple of 500.
Any dimnames
that were saved to data
will
be applied to the samples as appropriate. Here, we pull in the
theta
samples for our test Michigan dataset with a
2000-sample burn-in period (as specified by the default arguments). We
also multiply by 100,000 as it is common to display mortality rates per
100,000 individuals:
theta <- load_samples("my_test_model") * 1e5
Age-standardization
But what if we want to age-standardize our samples? In our Michigan
dataset, we have six 10-year age groups started at age 35 years. Perhaps
we can age-standardize these into a new 35-64 group. Since we are using
data from 1979-1988, we can use 1980 standard populations from NIH
to generate our std_pop
vector:
With std_pop
generated, we need to then check which
margin contains our age group information using the dim()
function:
dim(theta)
#> [1] 83 6 10 400
Our theta
array has four margins with dimensions
83
, 6
, 10
, and 400
,
representing the spatial regions, age groups, time periods, and samples,
respectively. Let’s set a variable margin_age
to represent
our age group margin and standardize our theta
estimates
using the age_standardize()
function:
margin_age <- 2
theta_3564 <- age_standardize(theta[, 1:3, , ], std_pop, margin_age)
We subset theta
to only the columns containing our age
groups of interest: 35-44
, 45-54
, and
55-64
(i,e,. columns 1-3).
Now, we have a standalone array for our age-standardized 35-64 age
group. Let’s consolidate this into our main theta
array
using the bind_samples()
function:
theta <- bind_samples(
sample = theta,
agg_sample = theta_3564,
margin = margin_age,
new_name = "35-64"
)
Now, our last samples and our age-standardized samples live in the
same array. In case we are trying to do aggregation without
standardization, such as weighted averages among non-age groups, it is
important to also pull in the population array as the weight and to
calculate population information along any consolidated or standardized
groups. If the population data isn’t readily available, it can be pulled
from the model folder using the load_pop()
function, then
prepared for use with theta
using
aggregate_pop()
and bind_samples()
:
pop <- load_pop("my_test_model")
pop <- bind_samples(
sample = pop,
agg_sample = aggregate_pop(pop[, 1:3, ], margin_age),
margin = margin_age,
new_name = "35-64"
)
So now, what if we want to get estimates averaged across all years?
In this case, we will expand our theta
and pop
arrays using the aggregate_group()
and
aggregate_pop()
functions, respectively:
margin_year <- 3
theta <- bind_samples(
theta,
aggregate_group(theta, pop, margin_year),
margin_year,
"1979-1988"
)
pop <- bind_samples(
pop,
aggregate_pop(pop, margin_year),
margin_year,
"1979-1988"
)
Now, our samples for theta
are aggregated by year and
age-standardized and we have matching array values for pop
.
Note that if you plan on doing a mix of non-age aggregation and
age-standardization, do age-standardization after all aggregation, as
the doing age-standardization first will alter the results of any
aggregation done afterward.
Note that the process of age- and group-standardization is identical for both MSTCAR and MCAR models, and that age/group-standardization is not possible for UCAR models.
Traceplots
Before we generate our median estimates for our samples, let’s quickly check some high- and low-population counties to see the stability of the samples:
counties <- c(
which.max(miheart$n[, "55-64", "1979"]),
which.min(miheart$n[, "55-64", "1979"])
)
x <- dimnames(theta)[[4]]
yhigh <- theta[counties[1], "35-64", "1979", ]
ylow <- theta[counties[2], "35-64", "1979", ]
raw_high <- sum(miheart$Y[counties[1], 1:3, "1979"]) /
sum(miheart$n[counties[1], 1:3, "1979"]) *
1e5
raw_low <- sum(miheart$Y[counties[2], 1:3, "1979"]) /
sum(miheart$n[counties[2], 1:3, "1979"]) *
1e5
par(mfrow = c(1, 2))
plot(
x,
yhigh,
type = "l",
main = paste0("Smoothed rates, 1979, 35-64, FIPS ", names(counties[1]))
)
abline(h = raw_high, col = "red")
plot(
x,
ylow,
type = "l",
main = paste0("Smoothed rates, 1979, 35-64, FIPS ", names(counties[2]))
)
abline(h = raw_low, col = "red")
This code generates two traceplots representing our sample values over time for a given county-group-year. On the left, we can see a traceplot for the highest-population county, and on the right is a traceplot for the lowest-population county. Each traceplot also includes a red line representing the mean event rate for that county-group-year. Note that in both traceplots, the mean line is nearby the plots, but higher, indicating that the rates in both of these counties were attenuated thanks to spatial smoothing.
The traceplot on the left is exactly what we want to see: it is
clearly fluctuating around a certain value and will give us reliable
estimates for that county. The right traceplot, however, is a bit less
favorable: the value doesn’t seem to want to stabilize and jumps between
values over the course of the model. However, the rate itself is
naturally high, so the intensity of the fluctuation isn’t shocking.
Smaller counties like the one shown here demonstrate the limits of
RSTr
: while the samples do hover around a single value for
some iterations, the estimated values we would get for the
county-group-year on the right will not be as reliable as estimates on
the left due to the large variability of the samples gathered. To learn
more about reliability measures, read
vignette("RSTr-medians")
.
Closing Thoughts
In this vignette, we discussed generating samples with
run_sampler()
, importing those samples into R and
age-standardization using load_samples()
, and generating
traceplots to get a gut-check on our dataset and the implications of
sporadic traceplots.