
02: Understanding and Preparing Your Event Data
RSTr-event.Rmd
Overview
The event and population data are at the core of the
RSTr
model. They work alongside the adjacency information
to generate smoothed estimates. In this vignette, we’ll discuss
requirements for event and population data and walk through an example
with a data.frame
.
Requirements
Data must be a
list
object with namesY
andn
for the event counts and for the population counts, respectively;Y
andn
are intended to be entire-population data. While it is possible to useRSTr
to analyze survey data or datasets that don’t include all members of a population of interest,RSTr
does not currently allow for the inclusion of survey weights and thus assumes that eachY / n
is an unbiased estimate of the underlying event rate;Y
andn
must contain real numbers. Negative and infinite counts are not allowed, but suppressed data containingNA
’s is acceptable for theY
values. Note, however, thatn
must have all population counts;For the MSTCAR model,
Y
andn
must be a three-dimensional array: the first margin (rows) specifies the region, the second margin (columns) specifies the groups of interest, and the third margin (matrix slices) specifies the time period. Other models will follow this same order of margins: for example, data for the MCAR model will be a two-dimensional array (matrix) with regions along the rows and groups along the columns. Data for the UCAR model can simply be a vector;Time periods, regions, and groups must be consistent. If your data contains counts for all regions in a specified set of groups for 1979 and 1981, for example, it must also include counts for all regions and all groups for 1980 as well, even if those counts have zero events;
Groups of many types are allowed as long as your sociodemographic groups are combined in the appropriate margin. For example, your groups may include just age groups, a mixture of age-sex groups, or even a mix of age-race-sex groups;
Finally,
Y
andn
can optionally have dimension names associated with them. This makes for easy identification of counties, groups, and time periods, and is necessary should you want to age-standardize data usingRSTr
’s additional functionality.
Example: CDC WONDER dataset
To walk through the data setup from a data.frame to the final array list, we will use data generated by CDC WONDER’s Underlying Cause of Death Compressed Mortality, ICD-9 database, found at https://wonder.cdc.gov/cmf-icd9.html:
library(RSTr)
head(maexample)
#> Notes Year Year.Code County County.Code Sex Sex.Code Deaths
#> 1 1979 1979 Barnstable County, MA 25001 Female F 15
#> 2 1979 1979 Barnstable County, MA 25001 Male M 57
#> 3 1979 1979 Berkshire County, MA 25003 Female F 11
#> 4 1979 1979 Berkshire County, MA 25003 Male M 63
#> 5 1979 1979 Bristol County, MA 25005 Female F 52
#> 6 1979 1979 Bristol County, MA 25005 Male M 191
#> Population Crude.Rate
#> 1 25239 59.4 (Unreliable)
#> 2 21261 268.1
#> 3 24884 44.2 (Unreliable)
#> 4 22465 280.4
#> 5 80171 64.9
#> 6 71943 265.5
Our example dataset contains acute myocardial infarction (ICD-9: 410)
mortality and population data in all counties of Massachusetts for men
and women aged 35-64 from 1979 to 1981. This dataset also includes some
notes in the bottom rows describing the dataset. maexample
contains several variables:
Notes
: Provides general information about the dataset, starting at row 85;Year
andYear.Code
specify the year;County
andCounty.Code
specify the county name and associated FIPS code;Sex
andSex.Code
specify the sex group;Deaths
contains our mortality counts of interest;Population
contains our population counts of interest;Crude.Rate
shows the rates per 100,000 in each year-county-sex group. This column will not be used by us.
The first thing we want to do with our dataset is remove the notes
from the bottom rows - while they are useful for getting acquainted with
the dataset, they will ultimately mess up our population arrays. Since
Year
does not have information in rows with notes, we can
use that to filter our data:
The above code searches for values in maexample$Year
that aren’t NA
and creates a new dataset containing only
those rows. Before we start generating our arrays, let’s take stock of
how our data is listed out:
head(ma_mort)
#> Notes Year Year.Code County County.Code Sex Sex.Code Deaths
#> 1 1979 1979 Barnstable County, MA 25001 Female F 15
#> 2 1979 1979 Barnstable County, MA 25001 Male M 57
#> 3 1979 1979 Berkshire County, MA 25003 Female F 11
#> 4 1979 1979 Berkshire County, MA 25003 Male M 63
#> 5 1979 1979 Bristol County, MA 25005 Female F 52
#> 6 1979 1979 Bristol County, MA 25005 Male M 191
#> Population Crude.Rate
#> 1 25239 59.4 (Unreliable)
#> 2 21261 268.1
#> 3 24884 44.2 (Unreliable)
#> 4 22465 280.4
#> 5 80171 64.9
#> 6 71943 265.5
We can use the xtabs()
function to transform our
data.frame
into mortality and population arrays with
properly oriented margins:
Y <- xtabs(Deaths ~ County.Code + Sex.Code + Year.Code, data = ma_mort)
n <- xtabs(Population ~ County.Code + Sex.Code + Year.Code, data = ma_mort)
When preparing data for the MSTCAR model, make sure that the
variables in your xtabs()
expression follow the order
listed above: geographic regions, sociodemographic groups, and years. If
you have multiple types of groups, such as race and sex, it can take a
little finessing to set up your group data, such as creating a combined
race-sex group variable, but data setup will follow the same principles
as above.
Now that our arrays are set up, organized, and properly named, we can
finally consolidate them into a list
to be used with the
model:
data <- list(Y = Y, n = n)
Note that you must specify the names of each array element as above,
as creating a list with just the objects will not name each element, and
the names Y
and n
are necessary for
RSTr
to know how to use the data.
Data setup for other models
The above dataset is prepared specifically for an MSTCAR model. But what if we only want to run an MCAR or even a UCAR model? We can filter the original dataset and follow a similar procedure to prepare our data for the MCAR model:
ma_mort_mcar <- maexample[which(!is.na(maexample$Year)), ]
ma_mort_mcar <- ma_mort_mcar[ma_mort_mcar$Year == 1979, ] # filter dataset to only show 1979 data
Y <- xtabs(Deaths ~ County.Code + Sex.Code, data = ma_mort_mcar)
n <- xtabs(Population ~ County.Code + Sex.Code, data = ma_mort_mcar)
data <- list(Y = Y, n = n)
head(data$Y)
#> Sex.Code
#> County.Code F M
#> 25001 15 57
#> 25003 11 63
#> 25005 52 191
#> 25007 4 1
#> 25009 70 239
#> 25011 6 27
head(data$n)
#> Sex.Code
#> County.Code F M
#> 25001 25239 21261
#> 25003 24884 22465
#> 25005 80171 71943
#> 25007 1498 1361
#> 25009 108762 98222
#> 25011 10188 9669
Note that xtabs()
works by aggregating data along the
specified variables in the expression argument. In the case of the MCAR
model, we filter down to the year we want because otherwise, it would
give us the mortality and population counts for all years in our dataset
instead of just for 1979.
For the UCAR model, setup is similar:
ma_mort_ucar <- maexample[which(!is.na(maexample$Year)), ]
ma_mort_ucar <- ma_mort_ucar[
ma_mort_ucar$Year == 1979 & ma_mort_ucar$Sex == "Male", # filter dataset to only show 1979 data for men
] # filter dataset to only show 1979 data for men
Y <- xtabs(Deaths ~ County.Code, data = ma_mort_ucar)
n <- xtabs(Population ~ County.Code, data = ma_mort_ucar)
data <- list(Y = Y, n = n)
head(data$Y)
#> County.Code
#> 25001 25003 25005 25007 25009 25011
#> 57 63 191 1 239 27
head(data$n)
#> County.Code
#> 25001 25003 25005 25007 25009 25011
#> 21261 22465 71943 1361 98222 9669
Closing Thoughts
In this vignette, we used data generated from CDC WONDER to construct
our event and population counts, remove unnecessary rows using
filter()
, and construct our arrays using
xtab()
. Setting up the data for RSTr
can seem
daunting at first, but with a few quick tricks in R, it can be easy to
have your data organized for analysis.