Skip to contents

Overview

The event and population data are at the core of the RSTr model. They work alongside the adjacency information to generate smoothed estimates. In this vignette, we’ll discuss requirements for event and population data and walk through an example with a data.frame.

Requirements

  • Data must be a list object with names Y and n for the event counts and for the population counts, respectively;

  • Y and n are intended to be entire-population data. While it is possible to use RSTr to analyze survey data or datasets that don’t include all members of a population of interest, RSTr does not currently allow for the inclusion of survey weights and thus assumes that each Y / n is an unbiased estimate of the underlying event rate;

  • Y and n must contain real numbers. Negative and infinite counts are not allowed, but suppressed data containing NA’s is acceptable for the Y values. Note, however, that n must have all population counts;

  • For the MSTCAR model, Y and n must be a three-dimensional array: the first margin (rows) specifies the region, the second margin (columns) specifies the groups of interest, and the third margin (matrix slices) specifies the time period. Other models will follow this same order of margins: for example, data for the MCAR model will be a two-dimensional array (matrix) with regions along the rows and groups along the columns. Data for the UCAR model can simply be a vector;

  • Time periods, regions, and groups must be consistent. If your data contains counts for all regions in a specified set of groups for 1979 and 1981, for example, it must also include counts for all regions and all groups for 1980 as well, even if those counts have zero events;

  • Groups of many types are allowed as long as your sociodemographic groups are combined in the appropriate margin. For example, your groups may include just age groups, a mixture of age-sex groups, or even a mix of age-race-sex groups;

  • Finally, Y and n can optionally have dimension names associated with them. This makes for easy identification of counties, groups, and time periods, and is necessary should you want to age-standardize data using RSTr’s additional functionality.

Example: CDC WONDER dataset

To walk through the data setup from a data.frame to the final array list, we will use data generated by CDC WONDER’s Underlying Cause of Death Compressed Mortality, ICD-9 database, found at https://wonder.cdc.gov/cmf-icd9.html:

library(RSTr)
head(maexample)
#>   Notes Year Year.Code                County County.Code    Sex Sex.Code Deaths
#> 1       1979      1979 Barnstable County, MA       25001 Female        F     15
#> 2       1979      1979 Barnstable County, MA       25001   Male        M     57
#> 3       1979      1979  Berkshire County, MA       25003 Female        F     11
#> 4       1979      1979  Berkshire County, MA       25003   Male        M     63
#> 5       1979      1979    Bristol County, MA       25005 Female        F     52
#> 6       1979      1979    Bristol County, MA       25005   Male        M    191
#>   Population        Crude.Rate
#> 1      25239 59.4 (Unreliable)
#> 2      21261             268.1
#> 3      24884 44.2 (Unreliable)
#> 4      22465             280.4
#> 5      80171              64.9
#> 6      71943             265.5

Our example dataset contains acute myocardial infarction (ICD-9: 410) mortality and population data in all counties of Massachusetts for men and women aged 35-64 from 1979 to 1981. This dataset also includes some notes in the bottom rows describing the dataset. maexample contains several variables:

  • Notes: Provides general information about the dataset, starting at row 85;

  • Year and Year.Code specify the year;

  • County and County.Code specify the county name and associated FIPS code;

  • Sex and Sex.Code specify the sex group;

  • Deaths contains our mortality counts of interest;

  • Population contains our population counts of interest;

  • Crude.Rate shows the rates per 100,000 in each year-county-sex group. This column will not be used by us.

The first thing we want to do with our dataset is remove the notes from the bottom rows - while they are useful for getting acquainted with the dataset, they will ultimately mess up our population arrays. Since Year does not have information in rows with notes, we can use that to filter our data:

ma_mort <- maexample[which(!is.na(maexample$Year)), ]

The above code searches for values in maexample$Year that aren’t NA and creates a new dataset containing only those rows. Before we start generating our arrays, let’s take stock of how our data is listed out:

head(ma_mort)
#>   Notes Year Year.Code                County County.Code    Sex Sex.Code Deaths
#> 1       1979      1979 Barnstable County, MA       25001 Female        F     15
#> 2       1979      1979 Barnstable County, MA       25001   Male        M     57
#> 3       1979      1979  Berkshire County, MA       25003 Female        F     11
#> 4       1979      1979  Berkshire County, MA       25003   Male        M     63
#> 5       1979      1979    Bristol County, MA       25005 Female        F     52
#> 6       1979      1979    Bristol County, MA       25005   Male        M    191
#>   Population        Crude.Rate
#> 1      25239 59.4 (Unreliable)
#> 2      21261             268.1
#> 3      24884 44.2 (Unreliable)
#> 4      22465             280.4
#> 5      80171              64.9
#> 6      71943             265.5

It’s important to take note of how the data cycles through each year-county-sex group so we can appropriately construct our array. In ma_mort, the data cycles through each sex group, then each county, and finally each year. This means that when we initially construct our arrays, the dimensions will be num_group x num_county x num_year. We know that in our dataset, there are 2 sex groups, 14 counties, and 3 years, so we can use that information to construct Y and n from ma_mort$Deaths and ma_mort$Population:

Y <- array(ma_mort$Deaths, dim = c(2, 14, 3))
n <- array(ma_mort$Population, dim = c(2, 14, 3))

Now we have our two arrays set up. However, we have two more things to address before our data is ready to be put into a list and into our model: the order of the margins and the dimension names. To fix the order of the margins, we can use the aperm() function, which helps to reorder the margins of a dataset as desired:

Y <- aperm(Y, perm = c(2, 1, 3))
n <- aperm(n, perm = c(2, 1, 3))

Now, we have our regions on the rows, our sex groups on the columns, and our time periods on the matrix slices. Finally, we can use a list in conjunction with the unique() function to specify the dimension names for both of these arrays:

dimnames(Y) <- dimnames(n) <- list(
  county = unique(ma_mort$County.Code),
  group = unique(ma_mort$Sex.Code),
  year = unique(ma_mort$Year.Code)
)
head(Y)
#> , , year = 1979
#> 
#>        group
#> county   F   M
#>   25001 15  57
#>   25003 11  63
#>   25005 52 191
#>   25007  4   1
#>   25009 70 239
#>   25011  6  27
#> 
#> , , year = 1980
#> 
#>        group
#> county   F   M
#>   25001 13  46
#>   25003 15  63
#>   25005 59 211
#>   25007  0   4
#>   25009 68 241
#>   25011 10  21
#> 
#> , , year = 1981
#> 
#>        group
#> county   F   M
#>   25001 15  52
#>   25003 13  42
#>   25005 49 201
#>   25007  1   5
#>   25009 74 225
#>   25011  5  21
head(n)
#> , , year = 1979
#> 
#>        group
#> county       F     M
#>   25001  25239 21261
#>   25003  24884 22465
#>   25005  80171 71943
#>   25007   1498  1361
#>   25009 108762 98222
#>   25011  10188  9669
#> 
#> , , year = 1980
#> 
#>        group
#> county       F     M
#>   25001  25578 21552
#>   25003  24842 22408
#>   25005  80195 71977
#>   25007   1432  1310
#>   25009 108882 98258
#>   25011  10137  9635
#> 
#> , , year = 1981
#> 
#>        group
#> county       F     M
#>   25001  26365 22381
#>   25003  24631 22285
#>   25005  80023 72116
#>   25007   1509  1392
#>   25009 109453 99112
#>   25011  10117  9643

If you have multiple types of groups, such as race and sex, it can take a little finessing to set up your group data, such as creating a combined race-sex group, but data setup will follow the same principles as above.

Now that our arrays are set up, organized, and properly named, we can finally consolidate them into a list to be used with the model:

data <- list(Y = Y, n = n)

Note that you must specify the names of each array element as above, as creating a list with just the objects will not name each element, and the names Y and n are necessary for RSTr to know how to use the data.

Data setup for other models

The above dataset is prepared specifically for an MSTCAR model. But what if we only want to run an MCAR or even a UCAR model? We can filter the original dataset and follow a similar procedure to prepare our data for the MCAR model:

ma_mort_mcar <- maexample[which(!is.na(maexample$Year)), ]
ma_mort_mcar <- ma_mort_mcar[ma_mort_mcar$Year == 1979, ] # filter dataset to only show 1979 data
Y <- matrix(ma_mort_mcar$Deaths, nrow = 2, ncol = 14) # data goes by sex group first, then county
n <- matrix(ma_mort_mcar$Population, nrow = 2, ncol = 14)
Y <- t(Y) # transpose so that regions are on rows, groups are on columns
n <- t(n)
dimnames(Y) <- dimnames(n) <- list(
  county = unique(ma_mort_mcar$County.Code),
  group = unique(ma_mort_mcar$Sex.Code)
)
data <- list(Y = Y, n = n)
head(data$Y)
#>        group
#> county   F   M
#>   25001 15  57
#>   25003 11  63
#>   25005 52 191
#>   25007  4   1
#>   25009 70 239
#>   25011  6  27
head(data$n)
#>        group
#> county       F     M
#>   25001  25239 21261
#>   25003  24884 22465
#>   25005  80171 71943
#>   25007   1498  1361
#>   25009 108762 98222
#>   25011  10188  9669

For the UCAR model, setup is even easier:

ma_mort_ucar <- maexample[which(!is.na(maexample$Year)), ]
ma_mort_ucar <- ma_mort_ucar[
  ma_mort_ucar$Year == 1979 & ma_mort_ucar$Sex == "Male",
] # filter dataset to only show 1979 data for men
Y <- ma_mort_ucar$Deaths
n <- ma_mort_ucar$Population
names(Y) <- names(n) <- unique(ma_mort_ucar$County.Code)
data <- list(Y = Y, n = n)
head(data$Y)
#> 25001 25003 25005 25007 25009 25011 
#>    57    63   191     1   239    27
head(data$n)
#> 25001 25003 25005 25007 25009 25011 
#> 21261 22465 71943  1361 98222  9669

Closing Thoughts

In this vignette, we used data generated from CDC WONDER to construct our event and population counts, remove unnecessary rows using filter(), construct arrays using array(), permute those arrays using aperm(), and named the arrays using dimnames(). Setting up the data for RSTr can seem daunting at first, but with a few quick tricks in R, it can be easy to have your data organized for analysis.