Acknowledgements

Several instructors have come before us and have made this work possible, those instructors include: Nat MacNell, Anne, among others.

All thanks to those above, but all errors are our own.

Overview of Homeworks 1-4

In this series of homeworks, we will analyze the effect of prenatal care (exposure) on preterm birth (outcome). To estimate this effect we will need to do the basics of any epidemiology project:

  1. Clean the data (i.e. “Datawrangling”" and making the data tidy)
  2. Descriptive Statistics
    1. Univariate Analyses
    2. Bivariate Analyses
    3. Data Visualization
    4. More data visualization
  3. Modeling

As an added bonus, these homeworks will also cover some basic spatial analyses and other “advanced techniques”.

Homework 1

Overview of Homework 1

The purpose of this homework is to introduce you to the under-the-hood machinery of R and provide some context/practice with data wrangling.

Question 1, Comments and code structure

  1. Now that you’ve got the code loaded, let’s think ahead and plan to write well-structured code. Show you can use these organization and commenting techniques throughout the following questions:
    1. Include a descriptive header and your research question for the overall analysis (put this as a #comment in your code header).
    2. Include “comment bars” for visual breaks.
    3. Include descriptive comments, both at the start of a line and after a line of runnable code.
    4. Include at least one comment with a relevant, clickable web reference
    5. Organize your code into collapsible code blocks
    6. Make your first code block loading basic libraries and setting your working directory with setwd().
# Please note this is an example of code chunks you may have at the top of a .R but would differ for a Rmarkdown file (which wraps the markdown language)

#............................
# Births 2012 Analysis for EPID 799C: PNC decrease PT birth? ####
# Mike Dolan Fliss, August 2017
# Aim: Does early prenatal care (<month 5) decrease the risk of preterm (<week 37) birth?
# (This Descriptive header is HW1.Q1a)
#............................ (<- example of HW1.Q1b, comment bar)
# Project Notes:
# Birth Certificate Revisions: https://www.cdc.gov/nchs/nvss/vital_certificate_revisions.htm (HW1.Q1d, clickable link)
#............................

#............................
# Libraries, directories, load data (HW1.Q1f) ####
#............................
# NOTE: May need to install these packages in advance on your local machine.
# install.packages("tableone")
library(tidyverse) # for ggplot, dplyr in HW1-4 (<- example of a post-line code comment, HW1.Q1c)
library(lubridate) # for dates in HW2
devtools::install_github("kaz-yos/tableone") # used in HW2, need for github for kableone functipn from PR #32 (https://github.com/kaz-yos/tableone/pull/32)
library(tableone)
# Set directories. 
data_dir = paste0(getwd(), "/data")
output_dir = paste0(getwd(), "/data")
# map_dir = paste0(data_dir, "/GIS") # used later
# Note you'll need to set different ones with setwd()! Since we're building this as a github repo the wd defaults to script dir.
# More on git here: http://r-bio.github.io/intro-git-rstudio/ . There's also a datacamp on it.
#............................

Question 2, Read the file

  1. This assignment will use the dataset births.csv from Sakai.
    1. Read in the “full” births2012.csv file into a data.frame called “births,” as we have done in class. This time, use the “stringsAsFactors” parameter set to F or FALSE to avoid R automatically coding character strings as factors. (In practice you will likely use tidyverse::read_csv but in this case use base::read.csv)
    2. Print out the first few rows. What function did you use? (5 points)
    3. Optional Challenge: Read Strings as Factors: An Authorized Biography and StringsAsFactors = sigh.
#............................
# Read data (HW1.Q2) ####
#............................
births = read.csv("data/births2012.csv", stringsAsFactors = F) # (HW1.Q2a)
names(births) = tolower(names(births)) #drop names to lowercase
head(births) # (HW1.Q2b)
##   x deltype bedcode stocc coocc cityocc stres cores cityresf        dob
## 1 1       1       K    NC    49   46340    NC    49    30120 2012-01-01
## 2 2       1       I    NC    47   73660    NC    47        0 2012-01-01
## 3 3       1       O    NC    81   31400    NC    81    31400 2012-01-01
## 4 4       1       O    NC    81   31400    NC    81    31400 2012-01-01
## 5 5       1       O    NC   127   57500    NC   127        0 2012-01-01
## 6 6       1       N    NC   179   43920    NC   179        0 2012-01-01
##   dobmonth dobday dobyear  tob sex plur sord liveb bwgrp mrace methnic
## 1        1      1    2012  621   1    1   99    99     6     1       N
## 2        1      1    2012 1659   1    1   99    99     6     1       N
## 3        1      1    2012 2211   1    1   99    99     6     2       N
## 4        1      1    2012 1144   2    1   99    99     7     2       N
## 5        1      1    2012  950   2    1   99    99     7     1       N
## 6        1      1    2012 1711   2    1   99    99     7     1       N
##   meduc mage mnat marital attend mtran visits mdif kessner kotel   zip wic
## 1     3   24   NC       1      1     2     15    2       1     4 28532   Y
## 2     2   25   FL       1      1     2     13    2       1     3 28472   N
## 3     2   29   IL       2      1     2      4    7       3     1 27262   Y
## 4     2   19   NC       2      1     2      6    6       2     1 27262   Y
## 5     1   25   NC       1      1     2     10    4       2     3 27822   Y
## 6     3   23   NC       1      1     2     15    4       2     4 28112   Y
##   totpreg lbliving lbdead term   dllb dllbmo dllbyear dtlterm dtlmo
## 1       2        1      0    0  92006      9     2006  888888    88
## 2       1        0      0    0 888888     88     8888  888888    88
## 3       6        3      0    2  42005      4     2005   62004     6
## 4       2        1      0    0  22010      2     2010  888888    88
## 5       4        3      0    0  82010      8     2010  888888    88
## 6       1        0      0    0 888888     88     8888  888888    88
##   dtlyear lastdelv cigbef cigdur pay pdiab gdiab phype ghype ppb ppo inft
## 1    8888        1      N      N   2     N     N     N     N   N   N    N
## 2    8888        9      Y      N   2     N     N     N     N   N   N    N
## 3    2004        1      N      N   1     N     N     Y     N   N   N    N
## 4    8888        1      N      N   1     N     N     N     N   N   N    N
## 5    8888        1      Y      Y   1     N     N     N     N   N   N    N
## 6    8888        9      N      N   1     N     N     N     N   N   N    N
##   psec npsec gon syph cham hepb hepc cerv toc ecvs ecvf prom pric prol
## 1    Y     1   N    N    N    N    N    N   N    N    N    Y    N    N
## 2    N     0   N    N    N    N    N    N   N    N    N    N    N    N
## 3    N     0   N    N    N    N    N    N   N    N    N    N    N    N
## 4    N     0   N    N    N    N    N    N   N    N    N    N    N    N
## 5    N     0   N    N    N    N    N    N   N    N    N    N    N    N
## 6    N     0   N    N    N    N    N    N   N    N    N    N    N    N
##   indl augl nvpr ster antb chor mecs fint esan attf attv pres rout tlab
## 1    N    N    N    N    Y    N    N    N    N    N    N    1    4    N
## 2    N    Y    N    N    N    N    N    N    Y    N    N    1    1    X
## 3    Y    N    N    N    N    N    N    Y    Y    N    N    1    1    X
## 4    N    N    N    N    N    N    N    N    Y    N    N    1    1    X
## 5    N    N    N    N    N    N    N    N    Y    N    N    1    1    X
## 6    N    N    N    N    N    N    N    N    N    N    N    1    1    X
##   mtr plac rut uhys aint uopr gest apgar5 apgar10 aven1 aven6 nicu surf
## 1   N    N   N    N    N    N   36      8      88     N     N    N    N
## 2   N    N   N    N    N    N   39      8      88     N     N    N    N
## 3   N    N   N    N    N    N   40      7      88     N     N    N    N
## 4   N    N   N    N    N    N   39      9      88     N     N    N    N
## 5   N    N   N    N    N    N   39      9      88     N     N    N    N
## 6   N    N   N    N    N    N   40      9      88     N     N    N    N
##   anti seiz binj anen mnsb cchd cdh omph gast limb cl cp dowt cdit hypo
## 1    N    N    N    N    N    N   N    N    N    N  N  N    N    N    N
## 2    N    N    N    N    N    N   N    N    N    N  N  N    N    N    N
## 3    N    N    N    N    N    N   N    N    N    N  N  N    N    N    N
## 4    N    N    N    N    N    N   N    N    N    N  N  N    N    N    N
## 5    N    N    N    N    N    N   N    N    N    N  N  N    N    N    N
## 6    N    N    N    N    N    N   N    N    N    N  N  N    N    N    N
##   itran bfed ehype inft_drg inft_art bmig wksgest
## 1     2    Y     N        X        X    3      38
## 2     2    Y     N        X        X    3      40
## 3     2    N     N        X        X    3      39
## 4     2    Y     N        X        X    3      37
## 5     2    Y     N        X        X    4      40
## 6     2    Y     N        X        X    4      40

Question 3, Basic Operations

  1. Basic operations on datasets
    1. Using all of these functions in in code: summary, head, tail, str, names, dim.
    2. Create a code chunk called “Exploring Data”
    3. Report the number of records and fields in the dataset.
    4. Report the five-number summary for the WKSGEST and MDIF variables.
    5. Pick three variables we will use for our analysis and report their type (numeric, character, etc.).
# ......................................
# Exploring Data - (HW1Q3) ####
# ......................................
dim(births) # (HW1.Q3b) 
## [1] 122513    117
summary(births[,c("wksgest", "mdif")]) # (HW1.Q3c) 
##     wksgest           mdif       
##  Min.   :18.00   Min.   : 1.000  
##  1st Qu.:38.00   1st Qu.: 2.000  
##  Median :39.00   Median : 3.000  
##  Mean   :38.83   Mean   : 6.183  
##  3rd Qu.:40.00   3rd Qu.: 4.000  
##  Max.   :99.00   Max.   :99.000
str(births[,c("mage", "wksgest", "mdif")]) # (HW1.Q3d) 
## 'data.frame':    122513 obs. of  3 variables:
##  $ mage   : int  24 25 29 19 25 23 20 22 36 32 ...
##  $ wksgest: int  38 40 39 37 40 40 37 39 42 40 ...
##  $ mdif   : int  2 2 7 6 4 4 4 2 1 2 ...
class(births$mage)
## [1] "integer"
class(births$wksgest)
## [1] "integer"
class(births$mdif) 
## [1] "integer"
str(births$mage)
##  int [1:122513] 24 25 29 19 25 23 20 22 36 32 ...
str(births$wksgest)
##  int [1:122513] 38 40 39 37 40 40 37 39 42 40 ...
str(births$mdif)
##  int [1:122513] 2 2 7 6 4 4 4 2 1 2 ...

Question 4, Subsetting & EDA

  1. Subsetting, loading packages & making graphics
    1. In the same “Exploring” code chunk, using the selection operators [], create a smaller version of the births file called births_sample with only the first 1000 rows and with only these variables:“MAGE”, “MDIF”, “VISITS”, “WKSGEST”, “MRACE”.
    2. Plot this smaller dataset all at once using the base plot(births_sample) function. Paste the plot in here. What does this plot show?
    3. Optional Challenges: Install and load the package “car”, and plot the smaller dataset using scatterplotMatrix(). If you are familiar with ggplot2, you might try the ggpairs() function in the GGally package. If you’re comfortable with functions, consider using the sample() function to get a random sample for births_sample, instead of the first 1000 rows.
# ......................................
# Exploring Data - (HW1Q4) ####
# ......................................
## (HW1.Q4) 
births_sample = births[1:1000, c("mage", "mdif", "visits", "wksgest", "mrace")] #A

plot(births_sample) #B

births_sample = births[sample(nrow(births), 1000), c("mage", "mdif", "visits", "wksgest", "mrace")] #C
# install.packages(c("car", "GGally"))
car::scatterplotMatrix(births_sample)

GGally::ggpairs(births_sample)

Question 5, Recoding

  1. Recode the following variables we plan to use in our main analysis. Note that we will learn faster ways to do many of these steps, but for now, use what you know.
    We’ll be coding integers, factors, and dates.
    Suggestion: start a new section called “Recoding.”.
    1. Prenatal Care
      1. Create a table of mdif and paste it here
      2. What do mdif 88 and mdif 99 mean?
      3. Assign 99 to R’s missing value (and 88, eventually, to no early PNC).
      4. Create a relevant univariate plot for mdif and paste it in.
      5. Create a numeric variable pnc5 that is a 1 if month prenatal care began is less than or equal to 5, 0 if later, and NA if missing.
      6. Create a factor variable pnc5_f based on pnc5 with labels “No Early PNC” and “Early PNC”
      7. Check your result with a bivariate table of pnc5_f and mdif (useNA=”always” will be relevant here). Paste it in here.
    2. Preterm Birth
      1. Create a histogram of wksgest and paste it here.
      2. Create an integer variable called preterm that is a 1 if wksgest is at least 20 weeks but less than 37 weeks, 0 if at least 37 weeks, and missing if missing.
      3. Create a factor variable preterm_f based on the preterm integer variable following the same coding as preterm, labeling the right levels as “term” and “preterm.”
      4. Optional Challenge: Create a factor variable preterm_f in one step, based on the original wksgest integer variable following the same coding as preterm using the cut() function, labeling the right levels as “term” and “preterm.”
    3. Plurality
      1. We’ll be using plurality as a selection critieria. Create a table of plur to confirm it is coded properly.
    4. Maternal Age
      1. Create a table of mage, and paste it here.
      2. What does mage =99 mean, specifically? See the code book.
      3. Assign mage =99 to R’s “missing” value.
      4. Create and paste in a univariate graph of mage. Consider using boxplot, density plot [hint: plot(density(x))] or histogram.
      5. Create a centered mage_centered variable in the dataset, equal to the mage minus the mean of mage. Report its fivenumber summary.
    5. Cigarette Use
      1. Recode the existing cigdur character variable to a new integer variable smoke so that it is coded as 1 for “Y”, 0 for “N”, and missing otherwise.
      2. Convert that integer variable smoke to a factor variable smoker_f with levels “Smoker” and “Non-smoker”
      3. Make sure your recoding is right by creating a two-way frequency table between the new variable smoke_f (row variable) and the old variable cigdur (column variable). Paste the output here.
    6. Date of Birth
      1. Find the lubridate package’s github page and skim it
      2. Run ?lubridate and vignette(“lubridate”) to explore the package
      3. Load the lubridate (or tidyverse!) package at the top of your script (suggestion: a chunk called “packages & data”)
      4. Use helpful functions like lubridate::ymd() to convert the dob string variable into a date-typed version called dob_d
    7. Sex
      1. Using skills from the previous questions, code the factor variable sex_f as “M”, “F”, or missing. Use a table to confirm your recoding.
    8. Maternal Race / Ethnicity (please see the homework for an important note/discussion on race-ethnicity!)
      1. Use a bivariate table() to investigate the covariate relationships of maternal race and ethnicity.
      2. Create a raceeth_f factor variable from mrace and methnic for these (limited!) categories: White non-Hispanic, White Hispanic, African American (any methnic value), American Indian or Alaska Native (any methnic value) and Other (including unknown or any missing values).
      3. Create one or more tables (note table(a,b,c) is valid syntax) to confirm coding of the race-ethnicity variable
## (HW1.Q5) 
### A. Prenatal Care (Exposure) 
#### i
table(births$mdif, useNA = "always") 
## 
##     1     2     3     4     5     6     7     8     9    88    99  <NA> 
##  9292 36987 41252 14783  7142  3984  2336  1556   971  2055  2155     0
#### ii
# See data dictionary, but 88 means no early prenatal care and 99 means missing

#### iii
births$mdif[births$mdif==99] <- NA 

#### iv
boxplot(births$mdif[births$mdif != 88], main="Month Prenatal Care Began, \n Excluding Unkowns and No Prenatal Care")

#### v
births$pnc5 <- ifelse(births$mdif <= 5, 1, 0) 

#### vi
births$pnc5_f <- factor(births$pnc5, levels=c(0,1), labels=c("No Early PNC", "Early PNC"))

#### vii
table(births$mdif, births$pnc5_f, useNA = "always")
##       
##        No Early PNC Early PNC  <NA>
##   1               0      9292     0
##   2               0     36987     0
##   3               0     41252     0
##   4               0     14783     0
##   5               0      7142     0
##   6            3984         0     0
##   7            2336         0     0
##   8            1556         0     0
##   9             971         0     0
##   88           2055         0     0
##   <NA>            0         0  2155
# (HW1.Q5 continued) 
### B. Preterm Birth (Outcome) 
#### i
hist(births$wksgest) #quick look

#### ii
births$preterm <- ifelse(births$wksgest >= 20 &births$wksgest < 37, 1,
                                     ifelse(births$wksgest >= 37 & births$wksgest < 99, 0, NA)
                        )

#### iii
births$preterm_f <- factor(births$preterm, levels=c(0,1), labels=c("term", "preterm"))

#### iv
births$preterm_f <- cut(births$wksgest, breaks = c(0,19,36,99), labels =  c("too young", "preterm", "term")) #should be no births too young at this point

# (HW1.Q5 continued) 
### C. Plurality (covariate) 
#### i
table(births$plur, useNA = "always") # looks fine, no 9 values for unknown
## 
##      1      2      3      4      5   <NA> 
## 118306   4080    118      4      5      0
# (HW1.Q5 continued) 
### D. Maternal Age (covariate) 
#### i
table(births$mage, useNA = "always") 
## 
##   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24 
##    1    1    5   23  102  355  846 1665 2877 4490 5539 6244 6564 6581 6533 
##   25   26   27   28   29   30   31   32   33   34   35   36   37   38   39 
## 6625 6716 7084 6953 7005 6934 6670 6026 5361 4681 4034 3266 2642 2030 1570 
##   40   41   42   43   44   45   46   47   48   49   50   51   52   54   55 
## 1152  806  510  293  156   64   36   26    8    6    5    2    2    1    1 
##   99 <NA> 
##   22    0
#### ii
# See data dictionary, but 99 means misssing

#### iii
births$mage[births$mage==99] <- NA 

#### iv
par(mfrow=c(1,2))
boxplot(births$mage, main="Boxplot of Maternal Age Distribution")
plot(density(births$mage[!is.na(births$mage)]), main="Density Plot of Maternal Age Distribution")

dev.off() # turning back to default
## null device 
##           1
#### v
births$mage_centered <- births$mage - mean(births$mage, na.rm = T)
hist(births$mage) # looks to be expected
summary(births$mage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   10.00   23.00   27.00   27.56   32.00   55.00      22
# (HW1.Q5 continued) 
### E. Cigarette Use (covariate) 
#### i
table(births$cigdur, useNA = "always") # check to see what values are in cigdur
## 
##      N      U      Y   <NA> 
## 108952    610  12951      0
births$cigdur <- ifelse(births$cigdur == "Y", 1, 
                     ifelse(births$cigdur == "N", 0, NA)) # NA for when case is U - i.e. unknown

#### ii
births$smoker_f <- factor(births$cigdur, levels=c(0,1), labels=c("Non-smoker", "Smoker"))

#### ii
table(births$cigdur, births$smoker_f, useNA = "always")
##       
##        Non-smoker Smoker   <NA>
##   0        108952      0      0
##   1             0  12951      0
##   <NA>          0      0    610
# (HW1.Q5 continued) 
### F. Date of Birth (covariate) 
#### i
# https://github.com/tidyverse/lubridate
#### ii
?lubridate
# vignette("lubridate") 

#### iii
library(lubridate) # normally this would go at the top of the script under a packages chunk 

#### iv
births$dob_d <- lubridate::ymd(births$dob)

# (HW1.Q5 continued) 
### G. Sex (covariate) 
#### i
table(births$sex, useNA = "always") # check what initial values are
## 
##     1     2     9  <NA> 
## 62523 59988     2     0
births$sex[births$sex==9] <- NA
births$sex_f <- factor(births$sex, levels=c(1,2), labels=c("Male", "Female"))
table(births$sex_f, births$sex, useNA="always")
##         
##              1     2  <NA>
##   Male   62523     0     0
##   Female     0 59988     0
##   <NA>       0     0     2
# (HW1.Q5 continued) 
### G. Maternal Race/Ethnicity (covariate) 
#### i
table(births$mrace, births$methnic, useNA = "always") # check what initial values are
##       
##            N     U     Y  <NA>
##   1    68995    47  3026     0
##   2    28842    12   487     0
##   3     1645     0    81     0
##   4     4877    12 14489     0
##   <NA>     0     0     0     0
## ii
mergedf <- data.frame(
  mrace = c(sort(rep(seq(1,4),3))),
  methnic = as.character(rep(c("Y", "N", "U"), 4)),
  raceeth_f =factor(c("White Hispanic", "White non-Hispanic", "Other",
                      rep("African American", 3), 
                      rep("American Indian or Alaska Native", 3),
                      rep("Other", 3)
  )
  ), 
  stringsAsFactors = F)

births <- merge(x=births, y=mergedf, by=c("mrace", "methnic"))

#### iii
table(births$mrace, births$methnic, births$raceeth_f, useNA = "always") # looks good
## , ,  = African American
## 
##       
##            N     U     Y  <NA>
##   1        0     0     0     0
##   2    28842    12   487     0
##   3        0     0     0     0
##   4        0     0     0     0
##   <NA>     0     0     0     0
## 
## , ,  = American Indian or Alaska Native
## 
##       
##            N     U     Y  <NA>
##   1        0     0     0     0
##   2        0     0     0     0
##   3     1645     0    81     0
##   4        0     0     0     0
##   <NA>     0     0     0     0
## 
## , ,  = Other
## 
##       
##            N     U     Y  <NA>
##   1        0    47     0     0
##   2        0     0     0     0
##   3        0     0     0     0
##   4     4877    12 14489     0
##   <NA>     0     0     0     0
## 
## , ,  = White Hispanic
## 
##       
##            N     U     Y  <NA>
##   1        0     0  3026     0
##   2        0     0     0     0
##   3        0     0     0     0
##   4        0     0     0     0
##   <NA>     0     0     0     0
## 
## , ,  = White non-Hispanic
## 
##       
##            N     U     Y  <NA>
##   1    68995     0     0     0
##   2        0     0     0     0
##   3        0     0     0     0
##   4        0     0     0     0
##   <NA>     0     0     0     0
## 
## , ,  = NA
## 
##       
##            N     U     Y  <NA>
##   1        0     0     0     0
##   2        0     0     0     0
##   3        0     0     0     0
##   4        0     0     0     0
##   <NA>     0     0     0     0
# ......................................

Question 6, Epi Packages

  1. Optional Challenges: Useful early Packages
    1. Use the tableone:: package to print out a Table 1 of some of your variables of interest
    2. Use the mice:: package to print out misingness patterns
    3. Use ggplot2:: package with ggpairs() to print a plot matrix of a sample() (for time/size reasons) of your data Note, the code here provided without the output
# ......................................
# Optional data exploration (HW1.Q6) ####
# ......................................

# (HW1.Q6) 
## A
#CreateTableOne(vars = c("preterm_f"), strata = c("mage", "pnc5_f", "smoker_f", "sex_f"), data = births) 

## B
# mice::md.pattern(births) # the output here is a lot, so I have opted so comment it out for space concerns 


## C
# GGally::ggpairs(data = births[1:1000, c("pnc5_f", "preterm_f")], title="Exposure versue Outcome for First 1,00 Obserations")

Homework 2

Overview of Homework 2

The purpose of this homework is to familiarize you with the apply family, which are powerful tools that build on vectorization (which converts to speed in R).

Question 1, Create Dates

  1. Create useful calculated variables
    1. Week number: Create the weeknum variable in the births dataset using the week() function in lubridate and the date of birth (dob) variable in already in.
# ......................................
# Eligibility criteria & core recoding  (HW2 part 1)
# ......................................
# Create many inclusion variables, all of which have to be =1 to be included.
births$weeknum = week(births$dob_d) #thanks, lubridate! (HW2.1.Q1)

Question 2, Exclusion Criteria

  1. We will now exclude births from the study to simplify and hone our research question. Specifically, we’ll look at births that have data on our variables of interest, had all their weeks of gestation at risk of preterm birth within our study year, are single births, and have no congenital anomalies. In each case we’ll create a variable starting with “incl_” in the dataset so we can keep track of various criteria for study inclusion, with the variable = 1 meaning it passes that inclusion test, and 0 meaning it does not pass. We’ll use apply() functions where helpful to help reduce our code. The as.numeric() function may be helpful to cast Booleans to 1s and 0s, though T and F are interpreted intelligently as 1 and 0 as well – these instructions are based on 1s and 0s, but you may use T and F if that makes more sense to you.
    1. Has Gestation: Create incl_hasgest in the births dataset to be a 1 if the birth has wksgest data, 0 if missing.
    2. Enough gest: Create incl_enoughgest in the births dataset to be a 1 if the birth has >= 20 weeks of gestation, 0 if otherwise.
    3. Late enough: Create incl_lateenough to be 1 when wksgest-weeknum <= 19, 0 otherwise.
    4. Early enough: Create incl_earlyenough to be 1 when weeknum - wksgest <= 7, 0 otherwise.
    5. Singletons: Create incl_singleton to be 1 when plur is 1, 0 otherwise.
# a. Has gestation data (weeks), and at least 20 weeks gestation pre-birth (weeks >= 20)
births$incl_hasgest = as.numeric(!is.na(births$wksgest)) #(HW2.1.Q2a)
births$incl_enoughgest = as.numeric(births$wksgest >= 20) #(HW2.1.Q2b)
# b. Less than 20 weeks gestation before Jan 1 of year (LMP > Aug 20), or weeks-weeknum>=19, w/ weeknum=1 for Jan 1-7
births$incl_lateenough = as.numeric(births$wksgest - births$weeknum <= 19) #(HW2.1.Q2c)
# c. Start date of gestation was 45 (max gest in'11) w prior to Jan 1 of next year, so all births are observed in year
births$incl_earlyenough = as.numeric(births$weeknum - births$wksgest <=7) #(HW2.1.Q2d)
# d. singletons only
births$incl_singleton = as.numeric(births$plur == 1) #(HW2.Q2e)

Question 3, Create congenital anomaly variable using apply()

  1. Create congenital anomaly variable using apply() There is no single “has any congenital anomaly” variable, so we’ll need to create one. Some hints: Use the MARGIN parameter of apply() to decide between rows and columns (we’ll want rows). The dataframe to pass to apply() will need to be subsetted to just the congen_anom variables. It may be easiest to send that dataframe into apply() as all Booleans (e.g. df[subsetting-goes-here] ==”N”).
    1. Create name vector: Create a variable congen_anom (outside births) to be equal to a vector of the names of congenital anomaly variables. You would normally refer to the meta data, but to save you typing, here are those variables in an array: c("anen","mnsb","cchd", "cdh", "omph","gast", "limb", "cl","cp","dowt","cdit", "hypo")
    2. Check all anomalies at once: Use apply() with the all() function on the congenital anomaly columns of births to return a 1 if all congenital anomaly variables are “N”, 0 otherwise and assign it to variable incl_noanomalies.
# e. no congential anomalies - Let's use apply for efficiency
congen_anom = c("anen","mnsb","cchd", "cdh", "omph","gast", "limb", "cl","cp","dowt","cdit", "hypo") #(HW2.1.Q3a)
  # below is just a useful aside to check that there is No "U" (i.e. none missing) before running the code to make the new variable
  births$incl_hasanomdata = as.numeric(apply(births[,congen_anom]!="U", FUN=all, 1)) #All not missing = not any missing
births$incl_noanomalies = as.numeric(apply(births[,congen_anom]=="N", FUN=all, 1)) #All not present #(HW2.1.Q3b)

Question 4, Finish the inclusion criteria

  1. Finish the inclusion criteria
    1. Save your dataset, before we subset it, as old_births. If you need to in the future, you can birth=old_births to reset your work, since this dataset is relatively small (sometimes I’ll save these sorts of save points in a comment… ).
    2. Explore the grepl function. To take advantage of our incl_ naming convention, we’ll use a text search function, grepl. As an example, try (and note the misspelling of banana) : grepl("a", c("banana", "peach", "ornge"))
    3. (Optional Challenge): In one line of R, report the number that failed each inclusion test using apply() and sum over the column margins instead of rows. The grepl function may be useful to subset births to the inclusion criteria of interest.
    4. Create overall inclusion variable: Create a variable in our dataset include_allpass that represents a 1 if all inclusion criteria are passed. Using the grepl() function to find just the variables with incl_ in their name, subset births to be just those that have a 1 in every incl_ variable. This is similar to what’s done in 3c, above, but looks at rows instead of columns.
    5. Subset to create a new births dataset on the outcome of that apply function, just including births that pass the inclusion criteria.
    6. Summarize exclusion: Report the number of births before and after the inclusion criteria using nrow() or dim().
old_births <- births #(HW2.1.Q4a)

# Below is the example grepl (HW2.1.Q4b)
grepl("a", c("banana", "peach", "ornge")) 

[1] TRUE TRUE FALSE

eligibility_drops = nrow(births) - apply(births[,grepl("incl_", names(births))], FUN=sum, MARGIN = 2, na.rm=T) #(HW2.1.Q4c optional)
eligibility_drops #(HW2.1.Q4c optional)
incl_hasgest  incl_enoughgest  incl_lateenough incl_earlyenough 
           0               39            42317            15014 

incl_singleton incl_hasanomdata incl_noanomalies 4207 761 1051

births$include_allpass = apply(births[, grepl("incl_", names(births))], FUN=all, MARGIN = 1) #(HW2.1.Q4d indicator for the inclusion criteria yes/no)

births = births[births$include_allpass, ] #(HW2.1.Q4e now apply the inclusion criteria to subset your births data set)
  # to see what warnings are being thrown you can type warnings() -- below:  
  warnings() # Just letting me know I'm casting those 1s as TRUEs. That's ok.

dim(old_births) #(HW2.1.Q4f)

[1] 122513 134

dim(births) #(HW2.1.Q4f)

[1] 62370 135

Question 5, Optional Challenge

  1. Optional Challenge: Create an “any exclusion” variable in oldbirths (before we subset it). Use tableone::CreateTableOne grouped by that variable to describe the difference in the exclusion and inclusion cohorts.
#(HW2.1.Q5 Optional Challenge)
# Results of exclusion
cat("Leaving eligibility checks with n =", formatC(nrow(births), big.mark = ","), 
    "births of original", formatC(nrow(old_births), big.mark = ",")) #(HW2.1.Q4d)

Leaving eligibility checks with n = 62,370 births of original 122,513

# names(births)

vars_of_interest = c("pnc5_f", "preterm_f", "smoker_f", "wksgest", "sex", "meduc", "raceeth_f", "weeknum", "include_allpass")
tableone::kableone(tableone::CreateTableOne(data=births[, vars_of_interest])) #(HW2.1.Q5)
Overall
n 62370
pnc5_f = Early PNC (%) 56087 ( 91.2)
preterm_f (%)
too young 0 ( 0.0)
preterm 6114 ( 9.8)
term 56256 ( 90.2)
smoker_f = Smoker (%) 6642 ( 10.7)
wksgest (mean (sd)) 38.91 (2.44)
sex (mean (sd)) 1.49 (0.50)
meduc (mean (sd)) 2.79 (1.23)
raceeth_f (%)
African American 14840 ( 23.8)
American Indian or Alaska Native 884 ( 1.4)
Other 10013 ( 16.1)
White Hispanic 1581 ( 2.5)
White non-Hispanic 35052 ( 56.2)
weeknum (mean (sd)) 33.03 (8.11)
include_allpass = TRUE (%) 62370 (100.0)
  # let's look at who was included and excluded in our database 
  # This is the same as question 4 but now I am running it on the      birth dataset that has not been subsetted (i.e. we could have      already saved this variable earlier on or we can easily            recreate it)
old_births$include_allpass <- apply(old_births[, grepl("incl_", names(old_births))], FUN=all, MARGIN = 1)

tableone::kableone(tableone::CreateTableOne(data=old_births[, vars_of_interest], strata="include_allpass"))
FALSE TRUE p test
n 60143 62370
pnc5_f = Early PNC (%) 53369 (90.7) 56087 ( 91.2) 0.009
preterm_f (%) <0.001
too young 39 ( 0.1) 0 ( 0.0)
preterm 8041 (13.4) 6114 ( 9.8)
term 52063 (86.6) 56256 ( 90.2)
smoker_f = Smoker (%) 6309 (10.6) 6642 ( 10.7) 0.737
wksgest (mean (sd)) 38.76 (3.99) 38.91 (2.44) <0.001
sex (mean (sd)) 1.49 (0.50) 1.49 (0.50) 0.425
meduc (mean (sd)) 2.83 (1.33) 2.79 (1.23) <0.001
raceeth_f (%) 0.136
African American 14501 (24.1) 14840 ( 23.8)
American Indian or Alaska Native 842 ( 1.4) 884 ( 1.4)
Other 9412 (15.6) 10013 ( 16.1)
White Hispanic 1445 ( 2.4) 1581 ( 2.5)
White non-Hispanic 33943 (56.4) 35052 ( 56.2)
weeknum (mean (sd)) 20.68 (17.71) 33.03 (8.11) <0.001
include_allpass = TRUE (%) 0 ( 0.0) 62370 (100.0) <0.001
# ......................................
# NOTES: Finishes with 62,370 of 122,513 
# ......................................

Homework 3

Overview of Homework 3

In this section of the homework we will start with the tidyverse, and focus on the dplyr and ggplot2 packages. As discussed in class, there are powerful packages that make cleaning, analyzing, and graphing data “easy” (or at least tidy!).

Homework 3, Part 1

Question 1, Finish the inclusion criteria

Question 1A-C

  1. Mimic Graphs: Mimic these graphs as best you can using ggplot2 and paste them in below each graph. If they aren’t exact, that’s fine! Until we know how to use dplyr thoroughly to create summary data, you’ll likely need to use the ..prop and ..count hidden variables (as covered in ggplot2 class) to create these graphs. These are tough concepts; take your time and don’t be afraid to help each other!
    1. Figure 1A (distribution of gestational age)
    2. Figure 1B (weeknum distribution)
    3. Figure 1C (week vs. weeknum)
#(HW3.1.Q1A) Graph weeks
npop = formatC(sum(!is.na(births$wksgest)), big.mark=',') #(HW3.1.Q1a)
fig1a <- ggplot(data=births, aes(x=wksgest, y=..prop..))+geom_bar() + #Does what you need, as does geom_histogram()
  labs(title="Figure 1: Proportional distribution of gestational age at birth", 
       subtitle=paste0("From ", npop, " births in NC in 2012 from the NC SCHS, \n posted at Odum Institute"), 
       x="Weeks of gestation", y="Proportion of population") + 
  theme_bw()


  #(HW3.1.Q1a): Working with weeknum. - should double check this. Just used %V above.
#  hist(births$weeknum, breaks = min(births$weeknum):max(births$weeknum)) # Looks weird



#(HW3.1.Q1B) : Graph weeknum 
fig1b <- ggplot(births, aes(x=weeknum, y=..prop..)) + geom_bar() +
  labs(title="Throwaway: Investigating Weeknum", 
       subtitle="Note to self : Weeknum needs some work, \n doesn't quite match SAS", 
       x="Week number of birth", y="Proportion of population") +
  theme_bw()


#(HW3.1.Q1C) weeks vs. weeknum 
fig1c <- ggplot(births, aes(x=wksgest, y=weeknum)) + 
  geom_jitter(width=1, height=1, pch=".", alpha=0.3)+
  labs(title="Throwaway: week vs. weeknum", 
       subtitle="Note to self: Does this match what we expect given the inclusion criteria?",
       x="weeks of gestation", y="week number of birth in year")+
  theme_bw()

plot(gridExtra::arrangeGrob(fig1a, fig1b, fig1c, layout_matrix = rbind(c(1,2), c(3,3))))

Question 1D

4.  Change a graph from questions a, b or c using ggplot’s “+” syntax to add a layer. You can do anything you like, but as a suggestion, try taking a saved plot c and adding geom_bin2d(binwidth=1, alpha=0.8).  
#(HW3.1.Q1D)
# Note here that you can ggplot objects that are "appendable" 
# which is to say you can add on new features by adding on LAYERS 
plotObj <- ggplot(births, aes(x=wksgest, y=weeknum)) + 
  geom_jitter(width=1, height=1, pch=".", alpha=0.3)+
  labs(title="Throwaway: week vs. weeknum", 
       subtitle="Note to self: Does this match what we expect given the inclusion criteria?", 
       x="weeks of gestation", y="week number of birth in year")+
  theme_bw()

plot(plotObj)

## append new feature
plotObj + geom_bin2d(binwidth=1, alpha=0.8) + labs(caption = "Look we appended a layer here!")

Question 1E

5. Create your own! Create and submit any ggplot2 graphic you like built from the births dataset and using our variables of interest. Use at least two geometry layers, a title, a subtitle, and a caption that describes your interpretation of the punchline of your graph. 
#install.packages("ggridges")
library("ggridges")
ggplot() + 
  geom_density_ridges_gradient(data=births, aes(x=mage, y=factor(cut(wksgest, 10)), fill=..x..)) + 
  scale_fill_gradientn(colors=c("#fc8d59", "#ffffbf", "#91cf60")) + 
  facet_grid(meduc~raceeth_f) +
  ggtitle("Distribution of Weeks Gestation by Birth Outcome") + 
  labs(subtitle="This is the continuous variable we dichotomized") + 
  ylab("Weeks of Gestations factorized") + xlab("mage") + 
  theme_bw()

Homework 3, Part 2

This section of the homework will focus a bit more on “piping” with magrittr and using dplyr.

Question 2, Summary Table for mage

Create a summary table for maternal age: Using dplyr, create a summary table for mage from the births dataset whose head looks like below:
Investigate why maternal age of 54 gives a NaN for the newly created pct_earlyPNC variable (hint – what would the mean of all missing be?).

# ......................................
# Functional Form of maternal age w/ dplyr and ggplot2 #(HW3 part 2)
# ......................................

mage_df = births %>% group_by(mage) %>% #(HW3.2.2) setup 
  summarize(n=n(), 
            pct_earlyPNC = mean(pnc5, na.rm=T),
            pct_preterm = mean(preterm, na.rm=T)) %>% 
  mutate_if(is.numeric, round, 2)

DT::datatable(mage_df) #(HW3.2.2) answer 

Question 3, Explore functional form

  1. Create a ggplot similar to the below plot that demonstrates the functional form of maternal age to the risk of preterm birth. Note that the “weight” aesthetic will be important so that the larger center bins appropriately pull most of the quick loess model.
  2. Optional Challenge: Consider the “on the fly” linear square term model as a functional form – if you want to attempt it, check the help page for geom_smooth, looking at the method and formula parameters. Also check out UCLA’s IDRE help pages.
#(HW3.2.3A)  
fig2a <- ggplot(mage_df, aes(mage, pct_preterm))+ 
  geom_point(aes(size=n))+
  geom_smooth(aes(weight=n), color="blue", method="loess")+
  labs(title="% Preterm vs. Maternal Age", 
       x="maternal age", y="% preterm", 
       subtitle="Investigating function form of maternal age", 
       caption="Note for future glms: Seems quadratic. Blue is loess, red is square linear.")

#(HW3.2.3B)  Optional Challenge  
fig2b <- ggplot(mage_df, aes(mage, pct_preterm)) + 
  geom_point(aes(size=n)) +
  geom_smooth(aes(weight=n), color="blue") +
  geom_smooth(aes(weight=n), method="lm", formula=y ~ poly(x, 2), color="red", lty=2) + # this is the square error term
  labs(title="% Preterm vs. Maternal Age", 
       x="maternal age", y="% preterm", 
       subtitle="Investigating function form of maternal age", 
       caption="Note for future glms: Seems quadratic. Blue is loess, red is square linear.")



plot(gridExtra::arrangeGrob(fig2a, fig2b))

Homework 3, Part 3